[
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807478#comment-16807478
]
Qian Zhang commented on MESOS-9501:
-----------------------------------
To fix the above issue, we may consider to establish a TCP connection to
executor's libprocess address just like what we did in Docker containerizer
(see
[here|https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054]
for details.)
> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -------------------------------------------------------------------------
>
> Key: MESOS-9501
> URL: https://issues.apache.org/jira/browse/MESOS-9501
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 1.5.1, 1.6.1, 1.7.0
> Reporter: Meng Zhu
> Assignee: Qian Zhang
> Priority: Critical
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after
> the agent host reboot, a new process with the same pid gets spawned, then the
> parent will wait for the wrong child process. This could get stuck until the
> wrongly waited-for process is somehow exited, see `ReaperProcess::wait()`:
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update
> (e.g. master might still think the task is running).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)