[ 
https://issues.apache.org/jira/browse/MESOS-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849177#comment-15849177
 ] 

Kevin Klues commented on MESOS-7050:
------------------------------------

This bug came up during the development of a test for the new health checks 
beging developed by [~gkleiman] and [~alexr].  The test can be found here: 
https://reviews.apache.org/r/55901/

The new health checks rely on periodically launching nested container sessions 
with the agent, which increases the chances of them being destroyed at 
arbitrary times when the container they are execing into gets destroyed.

In the test we observed here, the setup was:
1) Start the scheduler driver
2) Launch a task group with the default executor that includes a single long 
running task
3) Wait for the task to return a status of HEALTHY one time
4) Stop the scheduler driver without explicitly waiting for any of the tasks to 
complete
5) Exit the test

With this setup, all of the ASSERTS in the test itself passed, but the test 
failed because there were remaining processes once the test exited (after a 
timeout of 15 seconds):
{noformat}
../../src/tests/cluster.cpp:570: Failure
Failed to wait 15secs for wait
...
[  FAILED  ] HealthCheckTest.DefaultExecutorCmdHealthCheck (15341 ms)
[----------] 1 test from HealthCheckTest (15347 ms total)
[----------] Global test environment tear-down
../../src/tests/environment.cpp:836: Failure
Failed
Tests completed with child processes remaining:
-+- 33680 /Users/klueska/projects/mesos/build/src/.libs/mesos-tests --verbose
 |-+- 33728 /Users/klueska/projects/mesos/build/src/.libs/mesos-containerizer 
launch
 | \--- 33770 sleep 120
 \--- 33772 /Users/klueska/projects/mesos/build/src/.libs/mesos-io-switchboard 
--heartbeat_interval=30secs --help=false --socket_address=/tmp/mesos-io-switch
[==========] 1 test from 1 test case ran. (15370 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] HealthCheckTest.DefaultExecutorCmdHealthCheck
 1 FAILED TEST
{noformat}

Exiting the test should cause all containers to be destroyed and cleaned up, so 
I started digging through the logs to figure out why this wasn't happening. The 
most interesting section of the log is:
{noformat}I0131 15:48:58.234776 528384 switchboard.cpp:601] Created I/O 
switchboard server (pid: 33772) listening on socket file 
'/tmp/mesos-io-switchboard-3491c69e-dbcf-4d13-b0f0-080f7516b2d6' for container 
416f480c-fd4f-496b-939d-7a85b42362d6.96251fd8-5c2a-4016-9308-3b7f15e24ba7.96251fd8-5c2a-4016-9308-3b7f15e24ba7-health-check
W0131 15:48:58.236217 4284416 http.cpp:2278] Failed to launch nested container 
416f480c-fd4f-496b-939d-7a85b42362d6.96251fd8-5c2a-4016-9308-3b7f15e24ba7.96251fd8-5c2a-4016-9308-3b7f15e24ba7-health-check:
 Container is being destroyed during preparing
I0131 15:49:03.227895 4284416 slave.cpp:5078] Killing executor 'default' of 
framework 65f6131d-5602-4f69-875b-fb84e8ad91c5-0000 (via HTTP)
E0131 15:49:03.231164 4820992 process.cpp:2419] Failed to shutdown socket with 
fd 12: Socket is not connected
E0131 15:49:03.231312 4820992 process.cpp:2419] Failed to shutdown socket with 
fd 19: Socket is not connected
E0131 15:49:03.231590 4820992 process.cpp:2419] Failed to shutdown socket with 
fd 9: Socket is not connected
E0131 15:49:03.231788 4820992 process.cpp:2419] Failed to shutdown socket with 
fd 7: Socket is not connected
I0131 15:49:03.237408 4820992 switchboard.cpp:782] Sending SIGTERM to I/O 
switchboard server (pid: 33772) since container 
416f480c-fd4f-496b-939d-7a85b42362d6.96251fd8-5c2a-4016-9308-3b7f15e24ba7.96251fd8-5c2a-4016-9308-3b7f15e24ba7-health-check
 is being destroyed
{noformat}

This shows that we are attempting to kill the {{default}} executor framework 
(which in turn will kill all of its children). However, the timing of this 
results in that an IOSwitchboard for container 
{{416f480c-fd4f-496b-939d-7a85b42362d6.96251fd8-5c2a-4016-9308-3b7f15e24ba7.96251fd8-5c2a-4016-9308-3b7f15e24ba7-health-check}}
 has already been created, but the container itself is being killed during the 
{{PREPARE}} phase. The IOSwitchboard then get's sent the {{SIGTERM}} signal, 
but we never see it actually exit.

To debug this further, I added some output to the IOSwitchboard main function 
to print timestamps when it started, received the SIGTERM, and when it exited. 
What I observed was that the timestamps for starting the switchboard and 
sending the SIGTERM signal from the agent logs matched up nicely with the 
timestamps generated from the switchboard itself. However, the exit timestamp 
wasn't until 15 seconds later once the entire test was torn down.

This lead me to believe that the IOSwitchboard was being blocked from exiting 
because of the condition it relies on to exit gracefully. Namely, it requires 
the FDs it allocates to a container to be closed before it will terminate. It 
requires this so that it can flush any remaining output the container might be 
generating before exiting cleanly. We don't yet have a "failsafe" in place to 
force kill the switchboard if it waits in this state for too long. Even if we 
did, the timeout would likely be on the order or minutes, because something 
really bad would have to be going on if it got stuck in this state (or if there 
was a bug as there appears to be now).

Going back to the containerizer code, I noticed that Jie had a comment that we 
were leaking the IOSwitchboard FDs in cases where the container launch path 
failed (which is exactly the case being considered here -- remember the line 
from the agent log for "{{Container is being destroyed during preparing}}").

Now I knew why the IOSwitchboard wasn't exiting, but why was the blocking the 
container itself from being destroyed and causing deadlock?

Well, that has to do with the way the way the destroy path works. When destroy 
is called, it will loop through all of its isolators calling {{cleanup()}} on 
each of them and waiting for all of their futures to be satisfied before fully 
destroying the container. When it hits the IOSwitchboard (which is an 
isolator), its cleanup function was blocked on the the switchboard itself 
exiting. Since its FDs were leaked and the switchboard could never exit, the 
destroy path was blocked, resulting in the observed deadlock.

The proposed fix then is to make sure that the IOSwitchboard FDs are not leaked 
on all code paths where the container->launch() call might fail.

> IOSwitchboard FDs leaked when containerizer launch fails -- leads to deadlock
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7050
>                 URL: https://issues.apache.org/jira/browse/MESOS-7050
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Kevin Klues
>            Assignee: Kevin Klues
>            Priority: Critical
>              Labels: debugging, mesosphere
>
> If the containizer launch path fails before actually
> launching the container, the FDs allocated to the container by the
> IOSwitchboard isolator are leaked. This leads to deadlock in
> the destroy path because the IOSwitchboard does not shutdown until the
> FDs it allocates to the container have been closed. Since the
> switchboard doesn't shutdown, the future returned by its 'cleanup()'
> function is never satisfied. 
> We need a general purpose method for closing the IOSwitchboard FDs when 
> failing in the launch path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to