Travis Hegner created MESOS-4581:
------------------------------------
Summary: mesos-docker-executor has a race condition causing docker
tasks to be stuck in staging when trying to launch
Key: MESOS-4581
URL: https://issues.apache.org/jira/browse/MESOS-4581
Project: Mesos
Issue Type: Bug
Components: containerization, docker
Affects Versions: 0.26.0, 0.27.0
Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
Reporter: Travis Hegner
We are still working to understand the root cause of this issue, but here is
what we know so far:
Symptoms:
Launching docker containers from marathon in mesos results in the marathon app
being stuck in a "staged" status, and the mesos task being stuck in a "staging"
status until a timeout, at which point it will launch on another host, with
approximately a 50/50 chance of working or being stuck staging again.
We have a lot of custom containers, custom networking configs, and custom
docker run parameters, but we can't seem to narrow this down to any one
particular aspect of our environment. This happens randomly per marathon app
while it's attempting to start or restart an instance, whether the apps config
has changed or not. I can't seem to find anyone else having a similar issue,
which leads me to believe that it is a culmination of aspects within our
environment to trigger this race condition.
Deeper analysis:
The mesos-docker-executor fires the "docker run ..." command in a future. It
simultaneously (for all intents and purposes) fires a "docker inspect" against
the container which it is trying to start at that moment. When we see this bug,
the container starts normally, but the docker inspect command hangs forever. It
never re-tries, and never times out.
When the task launches successfully, the docker inspect command fails once with
an exit code, and retries 500ms later, working successfully and flagging the
task as "RUNNING" in both mesos and marathon simultaneously.
If you watch the docker log, you'll notice that a "docker run" via the command
line actually triggers 3 docker API calls in succession. "create", "attach",
and "start", in that order. It's been fairly consistent that when we see this
bug triggered, the docker log has the "create" from the run command, then a GET
for the inspect command, then an "attach", and "start" later. When we see this
work successfully, we see the GET first (failing, of course because the
container doesn't exist yet), and then the "create", "attach", and "start".
Rudimentary Solution:
We have written a very basic patch which delays that initial inspect call on
the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of the
docker run command. This has eliminated the bug as far as we can tell.
I am not sure if this one time initial delay is the most appropriate fix, or if
it would be better to add a timeout to the inspect call in the
mesos-docker-executor, which destroys the current inspect thread and starts a
new one. The timeout/retry may be appropriate whether the initial delay exists
or not.
In Summary:
It appears that mesos-docker-executor does not have a race condition itself,
but it seems to be triggering one in docker. Since we haven't found this issue
anywhere else with any substance, we understand that it is likely related to
our environment. Our custom network driver for docker does some cluster-wide
coordination, and may introduce just enough delay between the "create" and
"attach" calls that are causing us to witness this bug at about a 50-60% rate
of attempted container start.
The inspectDelay patch that I've written for this issue is located in my
inspectDelay branch at:
https://github.com/travishegner/mesos/tree/inspectDelay
I am happy to supply this patch as a pull request, or put it through the review
board if the maintainers feel this is an appropriate fix, or at least as a
stop-gap measure until a better fix can be written.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)