[ 
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120104#comment-16120104
 ] 

Alexander Rukletsov commented on MESOS-6743:
--------------------------------------------

In case of an error, docker daemon and the container itself might be fine and 
operating normally, it’s the communication between mesos and the daemon which 
is broken. All docker stop failures are supposed to return [non-zero exit 
code|https://github.com/spf13/cobra/blob/9c28e4bbd74e5c3ed7aacbc552b2cab7cfdfe744/cobra/cmd/init.go#L187],
 even though docker docs [say 
nothing|https://docs.docker.com/engine/reference/commandline/stop] about it. It 
looks like we cannot reliably distinguish why the command fails, errors may 
originate [in the 
client|https://github.com/moby/moby/blob/e9cd2fef805c8182b719d489967fb4d1aa34eecd/client/request.go#L41]
 or [in the 
daemon|https://github.com/moby/moby/blob/77c9728847358a3ed3581d828fb0753017e1afd3/daemon/stop.go#L44].
 Hence the container might or might not have received the signal (the logs we 
have hint the latter).

In case of hanging (or timing out) commands, docker daemon likely malfunctions, 
while the container might or might not be fine and operating normally. Here is 
what we can do.

h3. Exit docker executor
+ This will trigger a terminal update for the task, allowing Marathon to start 
a new instance. 
\- The container might still be OK and running, but unknown to both Mesos and 
Marathon, which might be a problem for some apps.
\- The container becomes orphaned and consumes unallocated resources, until the 
next agent restart (if {{--docker_kill_orphans}} is set).

h3. Forcible kill task
Call {{os::killtree()}} or {{os::kill()}} on the container pid and then exit 
the executor.
+ This will trigger a terminal update for the task, allowing Marathon to start 
a new instance.
+ The container is not orphaned.
\- Might kill irrelevant process due to pid race.
\- Task’s kill policy might be violated since the container is not given enough 
time to terminate gracefully. This is particularly concerning especially if the 
daemon and the container are operating normally. It makes more sense in case of 
the timeout, see e.g. [https://reviews.apache.org/r/44571/].

We can implement our own escalation logic without using docker commands, 
similar to what we [do for the command 
executor|https://github.com/apache/mesos/blob/85af46f93d5625006d01bdcf78bba9fa547b3313/src/launcher/executor.cpp#L850-L871].
 However, this does not look right to me, since docker executor is supposed to 
rely on docker cli for task management.

h3. Retry
+ The container is not orphaned.
\- Until after the task is killed, Marathon sees it as {{TASK_KILLING}} and 
hence will not start other instances.
\- If retry is not successful for some time (which?) or attempts (how many), 
what shall we do?
\- If docker commands are hanging, make sure they are terminated properly.

h3. Let the framework retry
If {{docker stop}} fails, cancel the kill, maybe restart health checking, and 
send the scheduler {{TASK_RUNNING}} update.
+ The container is not orphaned.
+ A running healthy container will be treated as one and no actions from 
Marathon will be necessary (given health checks are restarted).
\- {{TASK_RUNNING}} after {{TASK_KILLING}} is probably confusing for framework 
authors.
\- Might require changes in Marathon and other frameworks that rely on docker 
executor.
\- What if the kill was issued not by the failing health check, but by the 
framework? Do we need a mechanism to forcefully kill the container?

Since docker executor is supposed to delegate all the commands to the docker 
daemon, this seems like the least surprising option. If docker is misbehaving 
on an agent, the docker executor does not try to workaround it, but relays the 
errors further to the framework.

h3. Kill the agent
+ The container is not orphaned.
+ The agent does not try to operate until after it (and its executors) can 
communicate with docker properly.
\- Requires non-trivial changes to the code.
\- Might be a surprising and undesirable behaviour to operators, especially if 
there are non-docker-executor workloads on the agent.

h3. Only transition to {{TASK_KILLING}} on successful stop
Pause the (health) checks, run {{docker stop}}, and send a {{TASK_KILLING}} 
update only if the command exited with status code 0, otherwise resume (health) 
checking. Frameworks should then assume that the kill order got lost/failed and 
retry.
+ Should require no changes to properly written frameworks.
+ The container is not orphaned.
+ A running healthy container will be treated as one and no actions from 
Marathon will be necessary.
\- Some frameworks might have to be updated if they don’t retry kills.
\- Since {{docker stop}} command does not exit until after the task finishes 
(imagine a task with a long grace period), this will effectively defeat the 
purpose of {{TASK_KILLING}}.

> Docker executor hangs forever if `docker stop` fails.
> -----------------------------------------------------
>
>                 Key: MESOS-6743
>                 URL: https://issues.apache.org/jira/browse/MESOS-6743
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 1.0.1, 1.1.0, 1.2.1, 1.3.0
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> If {{docker stop}} finishes with an error status, the executor should catch 
> this and react instead of indefinitely waiting for {{reaped}} to return.
> An interesting question is _how_ to react. Here are possible solutions.
> 1. Retry {{docker stop}}. In this case it is unclear how many times to retry 
> and what to do if {{docker stop}} continues to fail.
> 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. 
> However, in this case it is unclear what status updates we should send: 
> {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill 
> a task? or set a specific reason in {{TASK_KILLING}}?
> 3. Clean up and exit. In this case we should make sure the task container is 
> killed or notify the framework and the operator that the container may still 
> be running.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to