[
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435767#comment-16435767
]
Andrei Budnik commented on MESOS-8574:
--------------------------------------
[[email protected]] The issue was with a hanging Docker CLI that is used
in built-in Docker executor.
We've solved it by:
1) automatically retrying `docker inspect` every `DOCKER_INSPECT_TIMEOUT`
seconds in [the Docker
executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L220-L251]
and in the [Docker
containerizer|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/slave/containerizer/docker.cpp#L1717-L1753];
2) making `killTask` command retry-able for [the Docker
executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L543-L565].
If a scheduler (e.g. Marathon) retries `killTask`, then we abort previous
(possibly hanging) `docker stop` command and start a new `docker stop`.
> do we need docker inspect just for the pid of the newly launched container?
> wondering if there is an alternative to 'docker inspect' if its just pid.
We are calling `docker inspect` not only to get the pid of a container.
We have to wait until docker daemon marks container as `RUNNING` before
sending `TASK_RUNNING` status update to a scheduler,
otherwise we [might detect a container
termination|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L253-L257]
and send a terminal status update (e.g. `TASK_FINISHED`) before sending
`TASK_RUNNING`. So we wait for `docker inspect` to guarantee a correct order of
task status updates.
> Docker executor makes no progress when 'docker inspect' hangs
> -------------------------------------------------------------
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
> Issue Type: Improvement
> Components: docker, executor
> Affects Versions: 1.5.0
> Reporter: Greg Mann
> Assignee: Andrei Budnik
> Priority: Major
> Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are
> gated on an initial {{docker inspect}} call returning:
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but
> we must be careful of the case in which the executor's Docker container is
> actually running successfully, but the Docker daemon is unresponsive. In such
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's
> container is running successfully.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)