[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235 ]
Qian Zhang commented on MESOS-8568: ----------------------------------- I ran the exactly same reproduce steps with the above patch applied, and found this issue was gone, there is only one check container's sandbox directory at any time. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-0000/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers | grep check | wc -l {code} Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before `REMOVE_NESTED_CONTAINER` was called. {code:java} I0823 19:46:39.269901 32604 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module finished preparing container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18; IOSwitchboard server is required I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 'mesos_executors.slice' I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server (pid: 34701) listening on socket file '/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 and cloning with namespaces W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18: Collect failed: ==========Fake error========== I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 419840ns I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42086 I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 265728ns I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server (pid: 34701) since container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 is being destroyed I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 has terminated (status=0) I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42088 I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42090 I0823 19:46:47.428035 32636 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-f9d26e4a-aafd-427c-ac96-f4ddf050a13e' {code} Here is the default executor's stderr, we do not see any failed `REMOVE_NESTED_CONTAINER` calls. {code:java} Marked '/' as rslave I0823 19:46:34.180434 34636 executor.cpp:201] Version: 1.8.0 I0823 19:46:34.205943 34658 default_executor.cpp:204] Received SUBSCRIBED event I0823 19:46:34.207974 34658 default_executor.cpp:208] Subscribed executor on core-dev I0823 19:46:34.208364 34658 default_executor.cpp:204] Received LAUNCH_GROUP event I0823 19:46:34.209259 34666 default_executor.cpp:428] Setting 'MESOS_CONTAINER_IP' to: 10.0.49.2 I0823 19:46:34.222025 34633 default_executor.cpp:204] Received ACKNOWLEDGED event I0823 19:46:34.263268 34632 default_executor.cpp:663] Finished launching tasks [ test ] in child containers [ 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349 ] I0823 19:46:34.263308 34632 default_executor.cpp:687] Waiting on child containers of tasks [ test ] I0823 19:46:34.264135 34642 default_executor.cpp:748] Waiting for child container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349 of task 'test' I0823 19:46:34.309953 34657 default_executor.cpp:204] Received ACKNOWLEDGED event W0823 19:46:39.304706 34669 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==========Fake error==========) while launching COMMAND check for task 'test' I0823 19:46:44.415793 34625 checker_process.cpp:457] COMMAND check for task 'test' is not available W0823 19:46:47.460309 34624 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==========Fake error==========) while launching COMMAND check for task 'test' I0823 19:46:52.663522 34633 checker_process.cpp:457] COMMAND check for task 'test' is not available{code} > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > ------------------------------------------------------------------------------------------ > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement > Reporter: Andrei Budnik > Assignee: Qian Zhang > Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)