[
https://issues.apache.org/jira/browse/MESOS-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981738#comment-14981738
]
Gilbert Song commented on MESOS-3808:
-------------------------------------
Hi Chris, thank you for narrowing this block down.
I come to the same part as you found which may make this block. When restarting
Mesos slave, containerizer tries to recover containers for executor. It calls
reaped(), which triggers container destroy. I just submitted a patch in review
board, which just removes if/esle and run
docker->stop().onAny(defer(...__destroy...)) directly. I will continue to work
on this block this couple days.
> slave/containerizer/docker leaves orphan containers on restart of mesos-slave
> -----------------------------------------------------------------------------
>
> Key: MESOS-3808
> URL: https://issues.apache.org/jira/browse/MESOS-3808
> Project: Mesos
> Issue Type: Bug
> Components: containerization, docker, slave
> Affects Versions: 0.25.0
> Environment: CoreOS. Running mesos-slave in a container.
> Reporter: Chris Fortier
> Assignee: Gilbert Song
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker
> containers launched by Mesos were being orphaned and not destroyed when the
> Mesos agent was restarted.
> Relavent log output:
> {noformat}
> I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
> I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor
> 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor
> 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to
> executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596
> I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to
> executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469
> I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited
> E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13:
> Transport endpoint is not connected [107]
> I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
> I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited
> I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
> E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13:
> Transport endpoint is not connected [107]
> I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
> I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
> I1027 20:36:22.518592 23005 slave.cpp:3433] Executor
> 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework
> 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
> I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST
> (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
> I1027 20:36:22.519263 23005 slave.cpp:3433] Executor
> 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework
> 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
> I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST
> (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
> W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown
> container: a2308dfc-ec2f-4687-ae92-f045dd2d3614
> W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown
> container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31
> I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status
> update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing
> UPDATE for status update TASK_LOST (UUID:
> b07be363-433f-4a11-8c81-1f5787debc76) for task
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status
> update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing
> UPDATE for status update TASK_LOST (UUID:
> 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000
> W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST
> (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000 sent by status update manager because
> the slave is in RECOVERING state
> W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST
> (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework
> 20151016-161150-1902412554-5050-1-0000 sent by status update manager because
> the slave is in RECOVERING state
> I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered
> executors
> I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery
> {noformat}
> Docker output:
> {noformat}
> CONTAINER ID IMAGE COMMAND
> CREATED STATUS PORTS NAMES
> 8d0d69fe34d7 libmesos/ubuntu "/bin/sh -c 'while s
> About a minute ago Up About a minute
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31
> e4344cfbcc6d libmesos/ubuntu "/bin/sh -c 'while s
> About a minute ago Up About a minute
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2
> 3ce690f3b872 libmesos/ubuntu "/bin/sh -c 'while s
> 4 minutes ago Up 4 minutes
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614
> 5b4546d3087a libmesos/ubuntu "/bin/sh -c 'while s
> 4 minutes ago Up 4 minutes
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31
> {noformat}
> After digging in to the issue it seems the below comment might be the
> problem.
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97
> It appears that the recovery command is still only sending the containerId
> and not the frameworkId + containerId.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)