[ https://issues.apache.org/jira/browse/MESOS-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kone updated MESOS-6337: ------------------------------ Fix Version/s: (was: 1.1.0) > Nested containers getting killed before network isolation can be applied to > them. > --------------------------------------------------------------------------------- > > Key: MESOS-6337 > URL: https://issues.apache.org/jira/browse/MESOS-6337 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Linux > Reporter: Avinash Sridharan > Assignee: Gilbert Song > Labels: mesosphere > > Seeing this odd behavior in one of our clusters: > ``` > http.cpp:1948] Failed to launch nested container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to seed container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to setup hostname and network files: Failed to enter > the mount namespace of pid 21591: Pid 21591 does not exist > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 > 31531 containerizer.cpp:1931] Destroying container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in > ISOLATING state > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 > 31531 containerizer.cpp:2300] Container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has > exited > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 > 31534 systemd.cpp:96] Assigned child process '21591' to > 'mesos_executors.slice' > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 > 21580 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 > 21580 openssl.cpp:432] Will only verify peer certificate if presented! > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 > 21580 openssl.cpp:426] Will not verify peer certificate! > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 > 21580 openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory > path with LIBPROCESS_SSL_CA_DIR=<dirpath> > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 > 21580 openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before > InitGoogleLogging() is written to STDERR > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 > 21581 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification > ``` > The above log is "reverse" chronological order, so please read it bottom up. > The relevant log is: > ``` > http.cpp:1948] Failed to launch nested container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to seed container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to setup hostname and network files: Failed to enter > the mount namespace of pid 21591: Pid 21591 does not exist > ``` > Looks like the nested container failed to launch because the `isolate` call > to the `network/cni` isolator failed. Seems like when the isolator received > the `isolate` call the PID for the nested container has already exited and it > couldn't enter its mount namespace to setup the network files. > The odd thing here is that the nested container would have been frozen, and > hence was not running, so not sure what killed the nested container. My > suspicion falls on systemd, since I also see this log message: > ``` > Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 > 31532 systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice' > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)