[
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rukletsov updated MESOS-6743:
---------------------------------------
Sprint: Mesosphere Sprint 60
Labels: mesosphere reliability (was: mesosphere)
> Docker executor hangs forever if `docker stop` fails.
> -----------------------------------------------------
>
> Key: MESOS-6743
> URL: https://issues.apache.org/jira/browse/MESOS-6743
> Project: Mesos
> Issue Type: Bug
> Components: docker
> Affects Versions: 1.0.1, 1.1.0, 1.2.1, 1.3.0
> Reporter: Alexander Rukletsov
> Priority: Critical
> Labels: mesosphere, reliability
>
> If {{docker stop}} finishes with an error status, the executor should catch
> this and react instead of indefinitely waiting for {{reaped}} to return.
> An interesting question is _how_ to react. Here are possible solutions.
> 1. Retry {{docker stop}}. In this case it is unclear how many times to retry
> and what to do if {{docker stop}} continues to fail.
> 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill.
> However, in this case it is unclear what status updates we should send:
> {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill
> a task? or set a specific reason in {{TASK_KILLING}}?
> 3. Clean up and exit. In this case we should make sure the task container is
> killed or notify the framework and the operator that the container may still
> be running.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)