[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

Greg Mann (JIRA) Tue, 13 Feb 2018 00:18:41 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361964#comment-16361964
 ]


Greg Mann edited comment on MESOS-8574 at 2/13/18 8:11 AM:
-----------------------------------------------------------

Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The executor registration timeout may be a logical choice for the duration of 
this timeout.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work 
needed to make this discard effective).


was (Author: greggomann):
Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The executor registration timeout may be a logical choice for the duration of 
this timeout.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}}.

> Docker executor makes no progress when 'docker inspect' hangs
> -------------------------------------------------------------
>
>                 Key: MESOS-8574
>                 URL: https://issues.apache.org/jira/browse/MESOS-8574
>             Project: Mesos
>          Issue Type: Improvement
>          Components: docker, executor
>    Affects Versions: 1.5.0
>            Reporter: Greg Mann
>            Priority: Major
>              Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

Reply via email to