[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361964#comment-16361964 ]
Greg Mann commented on MESOS-8574: ---------------------------------- Based on discussions offline today, we started to converge on the following approach: * Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to enforce a timeout, after which the Docker executor calls {{docker->stop}} in an attempt to kill the container, if it's running. * The executor registration timeout may be a logical choice for the duration of this timeout. * The {{docker->stop}} call should be performed via {{killTask()}} so that a TASK_KILLING update is sent to the scheduler. This will allow the scheduler to retry the {{docker stop}} call by sending KILL calls. In such a case, we should discard any previously-made call to {{docker->stop}}. > Docker executor makes no progress when 'docker inspect' hangs > ------------------------------------------------------------- > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor > Affects Versions: 1.5.0 > Reporter: Greg Mann > Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)