[ 
https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115249#comment-15115249
 ] 

Travis Hegner commented on MESOS-3706:
--------------------------------------

This may be related to an issue we've been experiencing with mesos, marathon, 
and docker. In our case, there is a combination of issues with both the 
mesos-docker-executor, and with the docker daemon itself.

We are running mesos 0.27 (master branch), marathon 0.14, and docker 1.9.1.

Also, in our case, the container would actually start and run perfectly 
normally, even though the mesos interface said that it was stuck "STAGING". 
Typically, the task would reach a timeout and retry somewhere else where it 
may, or may not, get stuck staging again.

I'm very curious if this is the same cause for the issue described here. You 
can determine so by watching your docker log when your task tries to launch. 
When the `docker run` command is issued, the client actually does a `create`, 
`attach`, and `start` against the docker API. The `docker inspect` command 
issues a single GET against the container name to get it's JSON configuration. 
The threaded nature of the `mesos-docker-executor` causes the `run` and 
`inspect` commands to be issued simultaneously. Whenever we experienced the 
issue, the docker log would indicate that the internal commands were `create`, 
`inspect`, `attach`, then `start` in that order. I believe, but have not 
verified for certain, that the inspect command is hanging indefinitely because 
the container was not completely started, and as a result, the 
`mesos-docker-executor` is never receiving the `running` state. The `run` 
command would still complete normally, and the container would start without 
issue, even though the task was never reported as running.

I was able to create a rudimentary fix for our issue by injecting a small delay 
(only 500ms) between the `docker run` command, and the `docker inspect` 
command. This allowed the container to fully start properly before attempting 
to do an inspect, thereby avoiding whatever in the docker daemon was causing 
the inspect command to hang indefinitely.

I'd like to know if this is the exact cause for this issue, as I could submit a 
pull request against it, otherwise if it's a separate issue, then I could file 
a new one to submit a pull request to. If you'd like to try out our fix, there 
is a branch here: https://github.com/travishegner/mesos/tree/inspectDelay. 
Beware that this branch also contains our fix for #4370.

> Tasks stuck in staging.
> -----------------------
>
>                 Key: MESOS-3706
>                 URL: https://issues.apache.org/jira/browse/MESOS-3706
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, slave
>    Affects Versions: 0.23.0, 0.24.1
>            Reporter: Jord Sonneveld
>         Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot 
> 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO, 
> mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout
>
>
> I have a docker image which starts fine on all my slaves except for one.  On 
> that one, it is stuck in STAGING for a long time and never starts.  The INFO 
> log is full of messages like this:
> I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task 
> kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 
> 20150109-172016-504433162-5050-19367-0002
> E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: 
> Transport endpoint is not connected [107]
> kwe-vinland-work is the task that is stuck in staging.  It is launched by 
> marathon.  I have launched 161 instances successfully on my cluster.  But it 
> refuses to launch on this specific slave.
> These machines are all managed via ansible so their configurations are / 
> should be identical.  I have re-run my ansible scripts and rebooted the 
> machines to no avail.
> It's been in this state for almost 30 minutes.  You can see the mesos docker 
> executor is still running:
> jord@dalstgmesos03:~$ date
> Mon Oct 12 16:13:55 UTC 2015
> jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland
> root     35360  0.0  0.0 1070576 21476 ?       Ssl  15:46   0:00 
> mesos-docker-executor 
> --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox 
> --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --stop_timeout=0ns
> According to docker ps -a, nothing was ever even launched:
> jord@dalstgmesos03:/data/mesos$ sudo docker ps -a
> CONTAINER ID        IMAGE                                              
> COMMAND                  CREATED             STATUS              PORTS        
>                                     NAMES
> 5c858b90b0a0        registry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   39 minutes ago      Up 39 minutes       
> 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp   statsd-fe-influxdb
> d765ba3829fd        registry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   41 minutes ago      Up 41 minutes       
> 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp   statsd-repeater
> Those are the only two entries. Nothing about the kwe-vinland job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to