[jira] [Commented] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

Joseph Wu (JIRA) Wed, 07 Dec 2016 12:04:36 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15729768#comment-15729768
 ]


Joseph Wu commented on MESOS-6743:
----------------------------------

Just food for thought:

A timeout and retry for {{docker stop}} is definitely something we want to add 
(1).  

Suppose however, that {{docker stop}} is completely and forever broken.  In 
this case, it may be best for the executor to *kill the agent* (or somehow 
trigger the agent's death).  When the agent is restarted, it will then detect 
some orphan docker tasks (given {{--docker_kill_orphans}}), and attempt to kill 
them.  If that fails, the agent will fail to recover and start flapping 
(restart, detect orphans, fail to kill, suicide, ...).
^ This is preferable to me, compared to (2) and (3).

> Docker executor hangs forever if `docker stop` fails.
> -----------------------------------------------------
>
>                 Key: MESOS-6743
>                 URL: https://issues.apache.org/jira/browse/MESOS-6743
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 1.0.1, 1.1.0
>            Reporter: Alexander Rukletsov
>              Labels: mesosphere
>
> If {{docker stop}} finishes with an error status, the executor should catch 
> this and react instead of indefinitely waiting for {{reaped}} to return.
> An interesting question is _how_ to react. Here are possible solutions.
> 1. Retry {{docker stop}}. In this case it is unclear how many times to retry 
> and what to do if {{docker stop}} continues to fail.
> 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. 
> However, in this case it is unclear what status updates we should send: 
> {[TASK_KILLING}} for every kill retry? an extra update when we failed to kill 
> a task? or set a specific reason in {{TASK_KILLING}}?
> 3. Clean up and exit. In this case we should make sure the task container is 
> killed or notify the framework and the operator that the container may still 
> be running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

Reply via email to