[
https://issues.apache.org/jira/browse/MESOS-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Travis Hegner reassigned MESOS-4581:
------------------------------------
Assignee: Travis Hegner
> mesos-docker-executor has a race condition causing docker tasks to be stuck
> in staging when trying to launch
> ------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-4581
> URL: https://issues.apache.org/jira/browse/MESOS-4581
> Project: Mesos
> Issue Type: Bug
> Components: containerization, docker
> Affects Versions: 0.26.0, 0.27.0
> Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
> Reporter: Travis Hegner
> Assignee: Travis Hegner
>
> We are still working to understand the root cause of this issue, but here is
> what we know so far:
> Symptoms:
> Launching docker containers from marathon in mesos results in the marathon
> app being stuck in a "staged" status, and the mesos task being stuck in a
> "staging" status until a timeout, at which point it will launch on another
> host, with approximately a 50/50 chance of working or being stuck staging
> again.
> We have a lot of custom containers, custom networking configs, and custom
> docker run parameters, but we can't seem to narrow this down to any one
> particular aspect of our environment. This happens randomly per marathon app
> while it's attempting to start or restart an instance, whether the apps
> config has changed or not. I can't seem to find anyone else having a similar
> issue, which leads me to believe that it is a culmination of aspects within
> our environment to trigger this race condition.
> Deeper analysis:
> The mesos-docker-executor fires the "docker run ..." command in a future. It
> simultaneously (for all intents and purposes) fires a "docker inspect"
> against the container which it is trying to start at that moment. When we see
> this bug, the container starts normally, but the docker inspect command hangs
> forever. It never re-tries, and never times out.
> When the task launches successfully, the docker inspect command fails once
> with an exit code, and retries 500ms later, working successfully and flagging
> the task as "RUNNING" in both mesos and marathon simultaneously.
> If you watch the docker log, you'll notice that a "docker run" via the
> command line actually triggers 3 docker API calls in succession. "create",
> "attach", and "start", in that order. It's been fairly consistent that when
> we see this bug triggered, the docker log has the "create" from the run
> command, then a GET for the inspect command, then an "attach", and "start"
> later. When we see this work successfully, we see the GET first (failing, of
> course because the container doesn't exist yet), and then the "create",
> "attach", and "start".
> Rudimentary Solution:
> We have written a very basic patch which delays that initial inspect call on
> the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of
> the docker run command. This has eliminated the bug as far as we can tell.
> I am not sure if this one time initial delay is the most appropriate fix, or
> if it would be better to add a timeout to the inspect call in the
> mesos-docker-executor, which destroys the current inspect thread and starts a
> new one. The timeout/retry may be appropriate whether the initial delay
> exists or not.
> In Summary:
> It appears that mesos-docker-executor does not have a race condition itself,
> but it seems to be triggering one in docker. Since we haven't found this
> issue anywhere else with any substance, we understand that it is likely
> related to our environment. Our custom network driver for docker does some
> cluster-wide coordination, and may introduce just enough delay between the
> "create" and "attach" calls that are causing us to witness this bug at about
> a 50-60% rate of attempted container start.
> The inspectDelay patch that I've written for this issue is located in my
> inspectDelay branch at:
> https://github.com/travishegner/mesos/tree/inspectDelay
> I am happy to supply this patch as a pull request, or put it through the
> review board if the maintainers feel this is an appropriate fix, or at least
> as a stop-gap measure until a better fix can be written.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)