[ 
https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231282#comment-15231282
 ] 

Anand Mazumdar commented on MESOS-3573:
---------------------------------------

{code}
commit 1fa6340f8c8723b8d23934898f6e1599b9ba13c1
Author: Anand Mazumdar <[email protected]>
Date:   Wed Apr 6 11:10:59 2016 -0700

    Added test for recovering orphaned docker containers.

    Review: https://reviews.apache.org/r/45455/

commit 54926ad18d3ef90ad452a8e216ff7b1cd465df0a
Author: Anand Mazumdar <[email protected]>
Date:   Tue Apr 5 09:35:45 2016 -0700

    Cleaned up orphaned docker containers owned by previous agent instance.

    This change modifies the docker containerizer to cleanup docker
    containers left from another agent instance. The containers can
    become orphans due to any of the scenarios mentioned here:
    http://bit.ly/1RxCpPl

    This change modifies the logic to invoke docker `ps` on all
    containers on the agent instead of limiting itself to the
    current slaveID. This change also means that running multiple
    agent instances on the same host might not work well for docker
    containers from now on i.e. another agent instance might
    cleanup the docker containers that belong to another instance.
    The cgroup isolators/linux launcher for the Mesos containerizer
    anyways don't recommend running multiple instances of the agent
    on the same host.

    In case one still wants to run multiple agent instances on a
    test cluster using the docker containerizer, we can use the
    `--no-docker_kill_orphans` flag and then kill the docker
    containers manually using a script.

    Review: https://reviews.apache.org/r/45454/

commit ca747406574b51b17cdcce8ced2ac5d4dfaa091a
Author: Anand Mazumdar <[email protected]>
Date:   Tue Apr 5 09:35:19 2016 -0700

    Fixed minor spacing cleanups in docker containerizer.

    Review: https://reviews.apache.org/r/45453/
{code}

> Mesos does not kill orphaned docker containers
> ----------------------------------------------
>
>                 Key: MESOS-3573
>                 URL: https://issues.apache.org/jira/browse/MESOS-3573
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, slave
>            Reporter: Ian Babrou
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like 
> there were changes between 0.23.0 and 0.24.0 that broke cleanup.
> Here's how to trigger this bug:
> 1. Deploy app in docker container.
> 2. Kill corresponding mesos-docker-executor process
> 3. Observe hanging container
> Here are the logs after kill:
> {noformat}
> slave_1    | I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
> slave_1    | I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1    | I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop 
> on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1    | I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 
> 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 
> 20150923-122130-2153451692-5050-1-0000 terminated with signal Terminated
> slave_1    | I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 from @0.0.0.0:0
> slave_1    | I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
> slave_1    | W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating 
> unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
> slave_1    | I1002 12:12:59.368671  7791 status_update_manager.cpp:322] 
> Received status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> slave_1    | I1002 12:12:59.368741  7791 status_update_manager.cpp:826] 
> Checkpointing UPDATE for status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> slave_1    | I1002 12:12:59.370636  7791 status_update_manager.cpp:376] 
> Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) 
> for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 to the slave
> slave_1    | I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 to [email protected]:5050
> slave_1    | I1002 12:12:59.371908  7791 slave.cpp:2899] Status update 
> manager successfully handled status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> master_1   | I1002 12:12:59.372047    11 master.cpp:4069] Status update 
> TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 from slave 
> 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 
> (172.16.91.128)
> master_1   | I1002 12:12:59.372534    11 master.cpp:4108] Forwarding status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> master_1   | I1002 12:12:59.373018    11 master.cpp:5576] Updating the latest 
> state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 to TASK_FAILED
> master_1   | I1002 12:12:59.373447    11 hierarchical.hpp:814] Recovered 
> cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; 
> mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: 
> cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from 
> framework 20150923-122130-2153451692-5050-1-0000
> {noformat}
> Another issue: if you restart mesos-slave on the host with orphaned docker 
> containers, they are not getting killed. This was the case before and I hoped 
> for this trick to kill hanging containers, but it doesn't work now.
> Marking this as critical because it hoards cluster resources and blocks 
> scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to