Meng Zhu created MESOS-9501:
-------------------------------

             Summary: Mesos executor fails to terminate and gets stuck after 
agent reboot.
                 Key: MESOS-9501
                 URL: https://issues.apache.org/jira/browse/MESOS-9501
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 1.7.0, 1.6.1, 1.5.1
            Reporter: Meng Zhu


When an agent host reboots, all of its containers are gone but the agent will 
still try to recover from its checkpointed state after reboot.

The agent will soon discover that all the cgroup hierarchies are gone and 
assume (correctly) that the containers are destroyed.

However, when trying to terminate the executor, the agent will first try to 
wait for the exit status of its container:
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631

Agent dose so by `waitpid` on the checkpointed child process pid. If, after the 
agent host reboot, a new process with the same pid gets spawned, then the 
parent will wait for the wrong child process. This could get stuck until the 
wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114

This will block the executor termination as well as future task status update 
(e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to