[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

Jan Schlicht (JIRA) Tue, 12 Apr 2016 06:12:13 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237126#comment-15237126
 ]


Jan Schlicht commented on MESOS-5188:
-------------------------------------

The executor will use the return code of a finished task to determine whether 
it was successful or not. By running a {{docker stop}} you'll send a 
{{SIGTERM}} to the sleep task. This will result in the sleep task terminating 
with a return code != 0 which is interpreted by the Mesos executor as failed, 
because only a return code == 0 is seen as successful.
Other tasks than {{sleep}} might work, because they could catch that 
{{SIGTERM}} and return gracefully. Hence it depends on the task you're running 
if this will result in TASK_FAILED. Something that cannot be covered by Mesos.

> docker executor thinks task is failed when docker container was stopped
> -----------------------------------------------------------------------
>
>                 Key: MESOS-5188
>                 URL: https://issues.apache.org/jira/browse/MESOS-5188
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 0.28.0
>            Reporter: Liqiang Lin
>             Fix For: 0.29.0
>
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER ID        IMAGE               COMMAND             CREATED           
>   STATUS              PORTS               NAMES
> b4813ba3ed4d        ubuntu              "sleep 300"         9 seconds ago     
>   Up 8 seconds                            
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> within 75secs
> I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
> allowed age: 2.342613645432778days
> I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
> I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
> master@192.168.56.110:5050
> I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. 
> Attempting to register without authentication
> I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
> '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000'
> I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
> master@192.168.56.110:5050
> I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
> resources
> I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 pid to 
> scheduler(1)@192.168.56.110:53375
> I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources 
> from  to
> I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is 
> identical to existing resources
> I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task 
> '99ee7dc74861' to executor '99ee7dc74861' of framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 at executor(1)@192.168.56.110:40725
> I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update 
> TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:21.308820 32308 status_update_manager.cpp:320] Received status 
> update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:12:21.310058 32308 slave.cpp:3593] Forwarding the update 
> TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to 
> master@192.168.56.110:5050
> I0407 09:12:21.310744 32308 slave.cpp:3503] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for 
> task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to 
> executor(1)@192.168.56.110:40725
> I0407 09:12:21.315765 32307 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:16:13.943831 32306 slave.cpp:3195] Handling status update 
> TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from 
> executor(1)@192.168.56.110:40725
> I0407 09:16:14.063880 32307 status_update_manager.cpp:320] Received status 
> update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:16:14.064471 32312 slave.cpp:3593] Forwarding the update TASK_FAILED 
> (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to 
> master@192.168.56.110:5050
> I0407 09:16:14.065014 32312 slave.cpp:3503] Sending acknowledgement for 
> status update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for 
> task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to 
> executor(1)@192.168.56.110:40725
> I0407 09:16:14.071398 32307 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:16:14.950019 32311 slave.cpp:3721] executor(1)@192.168.56.110:40725 
> exited
> I0407 09:16:14.993948 32308 docker.cpp:1997] Executor for container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' has exited
> I0407 09:16:14.994104 32308 docker.cpp:1717] Destroying container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:16:14.994875 32308 docker.cpp:1845] Running docker stop on container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:16:14.996547 32308 slave.cpp:4097] Executor '99ee7dc74861' of 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 exited with status 0
> I0407 09:16:14.996695 32308 slave.cpp:4201] Cleaning up executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 at 
> executor(1)@192.168.56.110:40725
> I0407 09:16:14.997028 32312 gc.cpp:55] Scheduling 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  for gc 6.99998846192days in the future
> I0407 09:16:14.997289 32308 slave.cpp:4289] Cleaning up framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:16:14.997691 32307 status_update_manager.cpp:282] Closing status 
> update streams for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
> I0407 09:16:14.997697 32312 gc.cpp:55] Scheduling 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861'
>  for gc 6.99998845958222days in the future
> I0407 09:16:14.998423 32312 gc.cpp:55] Scheduling 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000'
>  for gc 6.99998845141333days in the future
> I0407 09:16:18.470573 32308 slave.cpp:4593] Current disk usage 56.56%. Max 
> allowed age: 2.341048334320162days
> {code}
> Why docker executor thinks the task is failed when stop the docker container? 
> The docker container still exists. Should it be relay on state that docker 
> container is removed?
> Docker executor code
> {code}
>   void reaped(
>       ExecutorDriver* _driver,
>       const Future<Nothing>& run)
>   {
>     // Wait for docker->stop to finish, and best effort wait for the
>     // inspect future to complete with a timeout.
>     stop.onAny(defer(self(), [=](const Future<Nothing>&) {
>       inspect
>         .after(DOCKER_INSPECT_TIMEOUT, [=](const Future<Nothing>&) {
>           inspect.discard();
>           return inspect;
>         })
>         .onAny(defer(self(), [=](const Future<Nothing>&) {
>           CHECK_SOME(driver);
>           TaskState state;
>           string message;
>           if (!stop.isReady()) {
>             state = TASK_FAILED;
>             message = "Unable to stop docker container, error: " +
>                       (stop.isFailed() ? stop.failure() : "future discarded");
>           } else if (killed) {
>             state = TASK_KILLED;
>           } else if (!run.isReady()) {
>             state = TASK_FAILED;
>             message = "Docker container run error: " +
>                       (run.isFailed() ?
>                        run.failure() : "future discarded");
> {code}
> Another problem is, Mesos thinks the task is failed, and do some clean up for 
> executor. But the orphan docker container is not removed. I still can start 
> it manually, Mesos can not trace the resource usage of this orphan container.
> {code}
> # docker -H 192.168.56.110:54375 start b4813ba3ed4d
> b4813ba3ed4d
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

Reply via email to