[jira] [Updated] (MESOS-1915) Docker containers that fail to launch are not killed

Daniel Hall (JIRA) Mon, 13 Oct 2014 19:43:42 -0700

     [ 
https://issues.apache.org/jira/browse/MESOS-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Hall updated MESOS-1915:
-------------------------------
    Description: 
When we launch docker containers on our Mesos cluster using marathon we have 
noticed that we end up with several docker containers running, with only one of 
them actually being tracked my Mesos. When inspected the containers both have 
the same start time.

This seems to be because Mesos gives up on trying to start the container after 
1min, but fails to clean up the docker container because it is is not yet 
running. Eventually the container starts alongside all the other attempts mesos 
has made and we end up with several containers running with only one being 
tracked by Mesos.

I've pasted some logs from the slave below filter for that particular task, but 
it is pretty easy to replicate in our environment so I'm happy to provide 
further logs, details and analysis as required. This is becoming a bit problem 
for us so we are happy to help as much as possible.

{noformat}
Oct 13 04:47:42 mesosslave-1 mesos-slave[16647]: I1013 04:47:42.776945 16661 
docker.cpp:743] Starting container 'dd113461-4d18-4170-8e3f-9527e6d7f598' for 
task 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' (and executor 
'docker-test.11588a48-5294-11e4-adea-42010af0f51e') of framework 
'20140918-022627-519434250-5050-6171-0000'
Oct 13 04:48:42 mesosslave-1 mesos-slave[16647]: E1013 04:48:42.819563 16664 
slave.cpp:2205] Failed to update resources for container 
dd113461-4d18-4170-8e3f-9527e6d7f598 of executor 
docker-test.11588a48-5294-11e4-adea-42010af0f51e running task 
docker-test.11588a48-5294-11e4-adea-42010af0f51e on status update for terminal 
task, destroying container: No container found
Oct 13 04:49:29 mesosslave-1 mesos-slave[16647]: I1013 04:49:29.916460 16665 
slave.cpp:2538] Monitoring executor 
'docker-test.11588a48-5294-11e4-adea-42010af0f51e' of framework 
'20140918-022627-519434250-5050-6171-0000' in container 
'dd113461-4d18-4170-8e3f-9527e6d7f598'
Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.103175 16663 
docker.cpp:1286] Updated 'cpu.shares' to 102 at 
/cgroup/cpu/docker/6a581f5c2174dc76bcfb2e5b89fd9a4310732c384d93901a8b37da8aeb700468
 for container dd113461-4d18-4170-8e3f-9527e6d7f598
Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.105036 16663 
docker.cpp:1321] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
dd113461-4d18-4170-8e3f-9527e6d7f598
{noformat}



  was:
When we launch docker containers on our Mesos cluster using marathon we have 
noticed that we end up with several docker containers running, with only one of 
them actually being tracked my Mesos. When inspected the containers both have 
the same start time.

This seems to be because Mesos gives up on trying to start the container after 
1min, but fails to clean up the docker container because it is is not yet 
running. Eventually the container starts alongside all the other attempts mesos 
has made and we end up with several containers running with only one being 
tracked by Mesos.

I've pasted some logs from the slave below filter for that particular task, but 
it is pretty easy to replicate in our environment so I'm happy to provide 
further logs, details and analysis as required. This is becoming a bit problem 
for us so we are happy to help as much as possible.

{{ noformat }}
Oct 13 04:47:42 mesosslave-1 mesos-slave[16647]: I1013 04:47:42.776945 16661 
docker.cpp:743] Starting container 'dd113461-4d18-4170-8e3f-9527e6d7f598' for 
task 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' (and executor 
'docker-test.11588a48-5294-11e4-adea-42010af0f51e') of framework 
'20140918-022627-519434250-5050-6171-0000'
Oct 13 04:48:42 mesosslave-1 mesos-slave[16647]: E1013 04:48:42.819563 16664 
slave.cpp:2205] Failed to update resources for container 
dd113461-4d18-4170-8e3f-9527e6d7f598 of executor 
docker-test.11588a48-5294-11e4-adea-42010af0f51e running task 
docker-test.11588a48-5294-11e4-adea-42010af0f51e on status update for terminal 
task, destroying container: No container found
Oct 13 04:49:29 mesosslave-1 mesos-slave[16647]: I1013 04:49:29.916460 16665 
slave.cpp:2538] Monitoring executor 
'docker-test.11588a48-5294-11e4-adea-42010af0f51e' of framework 
'20140918-022627-519434250-5050-6171-0000' in container 
'dd113461-4d18-4170-8e3f-9527e6d7f598'
Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.103175 16663 
docker.cpp:1286] Updated 'cpu.shares' to 102 at 
/cgroup/cpu/docker/6a581f5c2174dc76bcfb2e5b89fd9a4310732c384d93901a8b37da8aeb700468
 for container dd113461-4d18-4170-8e3f-9527e6d7f598
Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.105036 16663 
docker.cpp:1321] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
dd113461-4d18-4170-8e3f-9527e6d7f598
{{ noformat }}




> Docker containers that fail to launch are not killed
> ----------------------------------------------------
>
>                 Key: MESOS-1915
>                 URL: https://issues.apache.org/jira/browse/MESOS-1915
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.20.1
>         Environment: Mesos 0.20.1 using the docker executor with a private 
> docker repository. Images often take up to 5 minutes to launch.
> /etc/mesos-slave/executor_registration_timeout is set to '10mins'
>            Reporter: Daniel Hall
>
> When we launch docker containers on our Mesos cluster using marathon we have 
> noticed that we end up with several docker containers running, with only one 
> of them actually being tracked my Mesos. When inspected the containers both 
> have the same start time.
> This seems to be because Mesos gives up on trying to start the container 
> after 1min, but fails to clean up the docker container because it is is not 
> yet running. Eventually the container starts alongside all the other attempts 
> mesos has made and we end up with several containers running with only one 
> being tracked by Mesos.
> I've pasted some logs from the slave below filter for that particular task, 
> but it is pretty easy to replicate in our environment so I'm happy to provide 
> further logs, details and analysis as required. This is becoming a bit 
> problem for us so we are happy to help as much as possible.
> {noformat}
> Oct 13 04:47:42 mesosslave-1 mesos-slave[16647]: I1013 04:47:42.776945 16661 
> docker.cpp:743] Starting container 'dd113461-4d18-4170-8e3f-9527e6d7f598' for 
> task 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' (and executor 
> 'docker-test.11588a48-5294-11e4-adea-42010af0f51e') of framework 
> '20140918-022627-519434250-5050-6171-0000'
> Oct 13 04:48:42 mesosslave-1 mesos-slave[16647]: E1013 04:48:42.819563 16664 
> slave.cpp:2205] Failed to update resources for container 
> dd113461-4d18-4170-8e3f-9527e6d7f598 of executor 
> docker-test.11588a48-5294-11e4-adea-42010af0f51e running task 
> docker-test.11588a48-5294-11e4-adea-42010af0f51e on status update for 
> terminal task, destroying container: No container found
> Oct 13 04:49:29 mesosslave-1 mesos-slave[16647]: I1013 04:49:29.916460 16665 
> slave.cpp:2538] Monitoring executor 
> 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' of framework 
> '20140918-022627-519434250-5050-6171-0000' in container 
> 'dd113461-4d18-4170-8e3f-9527e6d7f598'
> Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.103175 16663 
> docker.cpp:1286] Updated 'cpu.shares' to 102 at 
> /cgroup/cpu/docker/6a581f5c2174dc76bcfb2e5b89fd9a4310732c384d93901a8b37da8aeb700468
>  for container dd113461-4d18-4170-8e3f-9527e6d7f598
> Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.105036 16663 
> docker.cpp:1321] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> dd113461-4d18-4170-8e3f-9527e6d7f598
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1915) Docker containers that fail to launch are not killed

Reply via email to