Avinash Sridharan created MESOS-6337:
----------------------------------------
Summary: Nested containers getting killed before network isolation
can be applied to them.
Key: MESOS-6337
URL: https://issues.apache.org/jira/browse/MESOS-6337
Project: Mesos
Issue Type: Bug
Components: containerization
Environment: Linux
Reporter: Avinash Sridharan
Assignee: Gilbert Song
Fix For: 1.1.0
Seeing this odd behavior in one of our clusters:
```
http.cpp:1948] Failed to launch nested container
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e:
Collect failed: Failed to seed container
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e:
Collect failed: Failed to setup hostname and network files: Failed to enter the
mount namespace of pid 21591: Pid 21591 does not exist
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 31531
containerizer.cpp:1931] Destroying container
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in
ISOLATING state
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 31531
containerizer.cpp:2300] Container
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has
exited
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 31534
systemd.cpp:96] Assigned child process '21591' to 'mesos_executors.slice'
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 21580
process.cpp:882] Failed SSL connections will be downgraded to a non-SSL socket
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set
LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 21580
openssl.cpp:432] Will only verify peer certificate if presented!
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set
LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 21580
openssl.cpp:426] Will not verify peer certificate!
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 21580
openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory path
with LIBPROCESS_SSL_CA_DIR=<dirpath>
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 21580
openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL socket
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before
InitGoogleLogging() is written to STDERR
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 21581
process.cpp:882] Failed SSL connections will be downgraded to a non-SSL socket
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set
LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
```
The above log is "reverse" chronological order, so please read it bottom up.
The relevant log is:
```
http.cpp:1948] Failed to launch nested container
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e:
Collect failed: Failed to seed container
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e:
Collect failed: Failed to setup hostname and network files: Failed to enter the
mount namespace of pid 21591: Pid 21591 does not exist
```
Looks like the nested container failed to launch because the `isolate` call to
the `network/cni` isolator failed. Seems like when the isolator received the
`isolate` call the PID for the nested container has already exited and it
couldn't enter its mount namespace to setup the network files.
The odd thing here is that the nested container would have been frozen, and
hence was not running, so not sure what killed the nested container. My
suspicion falls on systemd, since I also see this log message:
```
Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 31532
systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice'
```
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)