Meng Zhu created MESOS-9501:
-------------------------------
Summary: Mesos executor fails to terminate and gets stuck after
agent reboot.
Key: MESOS-9501
URL: https://issues.apache.org/jira/browse/MESOS-9501
Project: Mesos
Issue Type: Bug
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Meng Zhu
When an agent host reboots, all of its containers are gone but the agent will
still try to recover from its checkpointed state after reboot.
The agent will soon discover that all the cgroup hierarchies are gone and
assume (correctly) that the containers are destroyed.
However, when trying to terminate the executor, the agent will first try to
wait for the exit status of its container:
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
Agent dose so by `waitpid` on the checkpointed child process pid. If, after the
agent host reboot, a new process with the same pid gets spawned, then the
parent will wait for the wrong child process. This could get stuck until the
wrongly waited-for process is somehow exited, see `ReaperProcess::wait()`:
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
This will block the executor termination as well as future task status update
(e.g. master might still think the task is running).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)