[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588976#comment-16588976 ]
Qian Zhang commented on MESOS-8568: ----------------------------------- Reproduce steps: 1. To simulate the failure of launching nested container via health check, change `CgroupsIsolatorProcess::isolate` a bit: {code:java} Future<Nothing> CgroupsIsolatorProcess::isolate( const ContainerID& containerId, pid_t pid) { + if (strings::startsWith(containerId.value(), "check")) { + return Failure("==========Fake error=========="); + } + {code} 2. Start Mesos master and agent. {code:java} $ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos $ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --port=36251 --work_dir=/home/qzhang/opt/mesos --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem {code} 3. Launch a nested container with check enabled. {code:java} $ cat task_group_health_check.json { "tasks":[ { "name" : "test", "task_id" : {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} ], "command": { "value": "touch aaa && sleep 1000" }, "check": { "type": "COMMAND", "command": { "command": { "value": "ls aaa > /dev/null" } }, "delay_seconds": 5, "interval_seconds": 3 } } ] } $ src/mesos-execute --master=10.0.49.2:5050 --task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code} 5. After a few minutes, there will be a lot of check container's sandbox directories not removed. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-0000/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers | grep check | wc -l 119 {code} And in the default executor's stderr, we see a lot of warning messages {code:java} ... W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==========Fake error==========) while launching COMMAND check for task 'test' I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' used for the COMMAND check for task 'test' I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 'test' is not available W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' (Collect failed: ==========Fake error==========) while launching COMMAND check for task 'test' W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal Server Error' (Nested container has not terminated yet) while removing the nested container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12' used for the COMMAND check for task 'test' I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 'test' is not available ...{code} So every time when the default executor called `REMOVE_NESTED_CONTAINER` to remove the previous check container, the call will fail with a 500 error. The reason that this call failed is the check container has not terminated yet (still in `DESTROYING` state), the agent log below also proved this. {code:java} I0822 07:37:45.051453 19063 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module finished preparing container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586; IOSwitchboard server is required I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 'mesos_executors.slice' I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server (pid: 19410) listening on socket file '/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 and cloning with namespaces W0822 07:37:45.082442 19079 http.cpp:2635] Failed to launch container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586: Collect failed: ==========Fake error========== I0822 07:37:45.084975 19068 linux_launcher.cpp:580] Asked to destroy container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.085175 19068 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:45.086316 19081 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.086796 19086 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586 after 344064ns I0822 07:37:45.088006 19081 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586 I0822 07:37:45.088351 19107 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586 after 237056ns I0822 07:37:45.130489 19102 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586' I0822 07:37:48.089303 19103 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:37312 I0822 07:37:48.089993 19103 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container 'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586' ... I0822 07:37:50.132818 19111 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server (pid: 19410) since container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 is being destroyed I0822 07:37:50.159760 19063 switchboard.cpp:913] I/O switchboard server process for container ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586 has terminated (status=0) {code} Agent started to destroy the check container at 07:37:45 but the check container was eventually destroyed at 07:37:50 because I/O switchboard server process was stilling running during this 5 seconds (it terminated at 07:37:50 when `IOSwitchboard::cleanup` sent `SIGTERM` to it), and default executor called `REMOVE_NESTED_CONTAINER` at 07:37:48 while the check container was still in `DESTROYING` state, so the call failed and the check container's sandbox directory and runtime directory will not be removed. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > ------------------------------------------------------------------------------------------ > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement > Reporter: Andrei Budnik > Assignee: Qian Zhang > Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)