[ 
https://issues.apache.org/jira/browse/MESOS-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner reassigned MESOS-4581:
------------------------------------

    Assignee: Travis Hegner

> mesos-docker-executor has a race condition causing docker tasks to be stuck 
> in staging when trying to launch
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-4581
>                 URL: https://issues.apache.org/jira/browse/MESOS-4581
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker
>    Affects Versions: 0.26.0, 0.27.0
>         Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
>            Reporter: Travis Hegner
>            Assignee: Travis Hegner
>
> We are still working to understand the root cause of this issue, but here is 
> what we know so far:
> Symptoms:
> Launching docker containers from marathon in mesos results in the marathon 
> app being stuck in a "staged" status, and the mesos task being stuck in a 
> "staging" status until a timeout, at which point it will launch on another 
> host, with approximately a 50/50 chance of working or being stuck staging 
> again.
> We have a lot of custom containers, custom networking configs, and custom 
> docker run parameters, but we can't seem to narrow this down to any one 
> particular aspect of our environment. This happens randomly per marathon app 
> while it's attempting to start or restart an instance, whether the apps 
> config has changed or not. I can't seem to find anyone else having a similar 
> issue, which leads me to believe that it is a culmination of aspects within 
> our environment to trigger this race condition.
> Deeper analysis:
> The mesos-docker-executor fires the "docker run ..." command in a future. It 
> simultaneously (for all intents and purposes) fires a "docker inspect" 
> against the container which it is trying to start at that moment. When we see 
> this bug, the container starts normally, but the docker inspect command hangs 
> forever. It never re-tries, and never times out.
> When the task launches successfully, the docker inspect command fails once 
> with an exit code, and retries 500ms later, working successfully and flagging 
> the task as "RUNNING" in both mesos and marathon simultaneously.
> If you watch the docker log, you'll notice that a "docker run" via the 
> command line actually triggers 3 docker API calls in succession. "create", 
> "attach", and "start", in that order. It's been fairly consistent that when 
> we see this bug triggered, the docker log has the "create" from the run 
> command, then a GET for the inspect command, then an "attach", and "start" 
> later. When we see this work successfully, we see the GET first (failing, of 
> course because the container doesn't exist yet), and then the "create", 
> "attach", and "start".
> Rudimentary Solution:
> We have written a very basic patch which delays that initial inspect call on 
> the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of 
> the docker run command. This has eliminated the bug as far as we can tell.
> I am not sure if this one time initial delay is the most appropriate fix, or 
> if it would be better to add a timeout to the inspect call in the 
> mesos-docker-executor, which destroys the current inspect thread and starts a 
> new one. The timeout/retry may be appropriate whether the initial delay 
> exists or not.
> In Summary:
> It appears that mesos-docker-executor does not have a race condition itself, 
> but it seems to be triggering one in docker. Since we haven't found this 
> issue anywhere else with any substance, we understand that it is likely 
> related to our environment. Our custom network driver for docker does some 
> cluster-wide coordination, and may introduce just enough delay between the 
> "create" and "attach" calls that are causing us to witness this bug at about 
> a 50-60% rate of attempted container start.
> The inspectDelay patch that I've written for this issue is located in my 
> inspectDelay branch at:
> https://github.com/travishegner/mesos/tree/inspectDelay
> I am happy to supply this patch as a pull request, or put it through the 
> review board if the maintainers feel this is an appropriate fix, or at least 
> as a stop-gap measure until a better fix can be written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to