[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435767#comment-16435767 ]
Andrei Budnik commented on MESOS-8574: -------------------------------------- [~vishant.si...@gmail.com] The issue was with a hanging Docker CLI that is used in built-in Docker executor. We've solved it by: 1) automatically retrying `docker inspect` every `DOCKER_INSPECT_TIMEOUT` seconds in [the Docker executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L220-L251] and in the [Docker containerizer|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/slave/containerizer/docker.cpp#L1717-L1753]; 2) making `killTask` command retry-able for [the Docker executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L543-L565]. If a scheduler (e.g. Marathon) retries `killTask`, then we abort previous (possibly hanging) `docker stop` command and start a new `docker stop`. > do we need docker inspect just for the pid of the newly launched container? > wondering if there is an alternative to 'docker inspect' if its just pid. We are calling `docker inspect` not only to get the pid of a container. We have to wait until docker daemon marks container as `RUNNING` before sending `TASK_RUNNING` status update to a scheduler, otherwise we [might detect a container termination|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L253-L257] and send a terminal status update (e.g. `TASK_FINISHED`) before sending `TASK_RUNNING`. So we wait for `docker inspect` to guarantee a correct order of task status updates. > Docker executor makes no progress when 'docker inspect' hangs > ------------------------------------------------------------- > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor > Affects Versions: 1.5.0 > Reporter: Greg Mann > Assignee: Andrei Budnik > Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)