[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235
 ] 

Qian Zhang commented on MESOS-8568:
-----------------------------------

I ran the exactly same reproduce steps with the above patch applied, and found 
this issue was gone, there is only one check container's sandbox directory at 
any time.
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-0000/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers
 | grep check | wc -l
{code}
Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before 
`REMOVE_NESTED_CONTAINER` was called.

 
{code:java}
I0823 19:46:39.269901 32604 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module 
finished preparing container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18;
 IOSwitchboard server is required
I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 
'mesos_executors.slice'
I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server 
(pid: 34701) listening on socket file 
'/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 and cloning with namespaces 
W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18:
 Collect failed: ==========Fake error==========
I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 419840ns
I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42086
I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 265728ns
I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O 
switchboard server (pid: 34701) since container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 is being destroyed
I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process 
for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 has terminated (status=0)
I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42088
I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42090
I0823 19:46:47.428035 32636 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-f9d26e4a-aafd-427c-ac96-f4ddf050a13e'
{code}
Here is the default executor's stderr, we do not see any failed 
`REMOVE_NESTED_CONTAINER` calls.
{code:java}
Marked '/' as rslave
I0823 19:46:34.180434 34636 executor.cpp:201] Version: 1.8.0
I0823 19:46:34.205943 34658 default_executor.cpp:204] Received SUBSCRIBED event
I0823 19:46:34.207974 34658 default_executor.cpp:208] Subscribed executor on 
core-dev
I0823 19:46:34.208364 34658 default_executor.cpp:204] Received LAUNCH_GROUP 
event
I0823 19:46:34.209259 34666 default_executor.cpp:428] Setting 
'MESOS_CONTAINER_IP' to: 10.0.49.2
I0823 19:46:34.222025 34633 default_executor.cpp:204] Received ACKNOWLEDGED 
event
I0823 19:46:34.263268 34632 default_executor.cpp:663] Finished launching tasks 
[ test ] in child containers [ 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349 ]
I0823 19:46:34.263308 34632 default_executor.cpp:687] Waiting on child 
containers of tasks [ test ]
I0823 19:46:34.264135 34642 default_executor.cpp:748] Waiting for child 
container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349 of 
task 'test'
I0823 19:46:34.309953 34657 default_executor.cpp:204] Received ACKNOWLEDGED 
event
W0823 19:46:39.304706 34669 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==========Fake error==========) while launching COMMAND check 
for task 'test'
I0823 19:46:44.415793 34625 checker_process.cpp:457] COMMAND check for task 
'test' is not available
W0823 19:46:47.460309 34624 checker_process.cpp:794] Received '400 Bad Request' 
(Collect failed: ==========Fake error==========) while launching COMMAND check 
for task 'test'
I0823 19:46:52.663522 34633 checker_process.cpp:457] COMMAND check for task 
'test' is not available{code}
 

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> ------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8568
>                 URL: https://issues.apache.org/jira/browse/MESOS-8568
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Andrei Budnik
>            Assignee: Qian Zhang
>            Priority: Blocker
>              Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to