[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588976#comment-16588976
 ] 

Qian Zhang commented on MESOS-8568:
-----------------------------------

Reproduce steps:

1. To simulate the failure of launching nested container via health check, 
change `CgroupsIsolatorProcess::isolate` a bit:
{code:java}
Future<Nothing> CgroupsIsolatorProcess::isolate(
     const ContainerID& containerId,
     pid_t pid)
 {
+  if (strings::startsWith(containerId.value(), "check")) {
+    return Failure("==========Fake error==========");
+  }
+
{code}
2. Start Mesos master and agent.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/home/qzhang/opt/mesos

$ sudo ./bin/mesos-slave.sh --master=10.0.49.2:5050 --port=36251 
--work_dir=/home/qzhang/opt/mesos 
--isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem
{code}
3. Launch a nested container with check enabled.
{code:java}
$ cat task_group_health_check.json
{
  "tasks":[
    {
      "name" : "test",
      "task_id" : {"value" : "test"},
      "agent_id": {"value" : ""},
      "resources": [
        {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
        {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
      ],
      "command": {
        "value": "touch aaa && sleep 1000"
      },
      "check": {
        "type": "COMMAND",
        "command": {
          "command": {
           "value": "ls aaa  > /dev/null"
          }
        },
        "delay_seconds": 5,
        "interval_seconds": 3
      }
    }
  ]
}

$ src/mesos-execute --master=10.0.49.2:5050 
--task_group=file:///home/qzhang/workspace/config/task_group_health_check.json{code}
5. After a few minutes, there will be a lot of check container's sandbox 
directories not removed.

 
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/c355abce-0088-4196-8376-d54c9963abdd-S0/frameworks/c355abce-0088-4196-8376-d54c9963abdd-0000/executors/default-executor/runs/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/containers/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/containers
 | grep check | wc -l
119
{code}
And in the default executor's stderr, we see a lot of warning messages

 

 
{code:java}
...
W0822 07:37:45.084581 19377 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==========Fake error==========) while launching COMMAND check 
for task 'test'
I0822 07:37:45.085053 19377 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.092411 19362 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
 used for the COMMAND check for task 'test'
I0822 07:37:48.093101 19362 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0822 07:37:48.130527 19373 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==========Fake error==========) while launching COMMAND check 
for task 'test'
W0822 07:37:51.099179 19360 checker_process.cpp:655] Received '500 Internal 
Server Error' (Nested container has not terminated yet) while removing the 
nested container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-5c7af0fc-ad73-4870-aba8-65a3fb4eae12'
 used for the COMMAND check for task 'test'
I0822 07:37:51.099799 19360 checker_process.cpp:457] COMMAND check for task 
'test' is not available
...{code}
 

So every time when the default executor called `REMOVE_NESTED_CONTAINER` to 
remove the previous check container, the call will fail with a 500 error. The 
reason that this call failed is the check container has not terminated yet 
(still in `DESTROYING` state), the agent log below also proved this.
{code:java}
I0822 07:37:45.051453 19063 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.058904 19088 switchboard.cpp:316] Container logger module 
finished preparing container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586;
 IOSwitchboard server is required
I0822 07:37:45.065444 19088 systemd.cpp:98] Assigned child process '19410' to 
'mesos_executors.slice'
I0822 07:37:45.065724 19088 switchboard.cpp:604] Created I/O switchboard server 
(pid: 19410) listening on socket file 
'/tmp/mesos-io-switchboard-048e2be0-4a2b-4c00-a846-0e8137507a85' for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.069316 19062 linux_launcher.cpp:492] Launching nested container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 and cloning with namespaces 
W0822 07:37:45.082442 19079 http.cpp:2635] Failed to launch container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586:
 Collect failed: ==========Fake error==========
I0822 07:37:45.084975 19068 linux_launcher.cpp:580] Asked to destroy container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.085175 19068 linux_launcher.cpp:622] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:45.086316 19081 cgroups.cpp:2838] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.086796 19086 cgroups.cpp:1229] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586
 after 344064ns
I0822 07:37:45.088006 19081 cgroups.cpp:2856] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586
I0822 07:37:45.088351 19107 cgroups.cpp:1258] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586
 after 237056ns
I0822 07:37:45.130489 19102 linux_launcher.cpp:654] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/ab8d9ad1-e85c-472a-8608-a059a3e5cdf4/mesos/d66f9d77-9a69-41dd-9a70-dffdec8b2fba/mesos/check-63cf1b95-1013-4859-ab09-bb913382c586'
I0822 07:37:48.089303 19103 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:37312
I0822 07:37:48.089993 19103 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER 
call for container 
'ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586'
...
I0822 07:37:50.132818 19111 switchboard.cpp:807] Sending SIGTERM to I/O 
switchboard server (pid: 19410) since container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 is being destroyed
I0822 07:37:50.159760 19063 switchboard.cpp:913] I/O switchboard server process 
for container 
ab8d9ad1-e85c-472a-8608-a059a3e5cdf4.d66f9d77-9a69-41dd-9a70-dffdec8b2fba.check-63cf1b95-1013-4859-ab09-bb913382c586
 has terminated (status=0)
{code}
Agent started to destroy the check container at 07:37:45 but the check 
container was eventually destroyed at 07:37:50 because I/O switchboard server 
process was stilling running during this 5 seconds (it terminated at 07:37:50 
when `IOSwitchboard::cleanup` sent `SIGTERM` to it), and default executor 
called `REMOVE_NESTED_CONTAINER` at 07:37:48 while the check container was 
still in `DESTROYING` state, so the call failed and the check container's 
sandbox directory and runtime directory will not be removed.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> ------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8568
>                 URL: https://issues.apache.org/jira/browse/MESOS-8568
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Andrei Budnik
>            Assignee: Qian Zhang
>            Priority: Blocker
>              Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to