[
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456573#comment-16456573
]
Vishant Singh commented on MESOS-8574:
--------------------------------------
[~abudnik]
not completely sure the reason for docker hang.
But it seems like the docker has stale information about running containers.
The container gets killed as part of a task kill request from marathon.As the
docker task-kill involves SIGTREM (first) and then SIGIKILL (after timeout),
the SIGKILL terminates the task but dockerd does not get updated of this state.
Might because the SIGKILL does not have signal handlers which can eventually
update the state information in docker.
After this, when a new task is launched on this host the docker inspect or
docker ps would be unresponsive.
At this point I have an monitoring on docker hang and idea is to restart the
docker if its in hung state.
> Docker executor makes no progress when 'docker inspect' hangs
> -------------------------------------------------------------
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
> Issue Type: Improvement
> Components: docker, executor
> Affects Versions: 1.5.0
> Reporter: Greg Mann
> Assignee: Andrei Budnik
> Priority: Major
> Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are
> gated on an initial {{docker inspect}} call returning:
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but
> we must be careful of the case in which the executor's Docker container is
> actually running successfully, but the Docker daemon is unresponsive. In such
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's
> container is running successfully.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)