[
https://issues.apache.org/jira/browse/MESOS-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391439#comment-14391439
]
Timothy Chen commented on MESOS-2583:
-------------------------------------
I think I understand what's going on, I think this most likely just affects the
Docker containerizer.
When a task is launched detached and fails before we launch the executor, the
subsequent update call to update the resources fails as the docker container
isn't running as we try to find the pid and it won't be able to find the
cgroups path as it was removed.
However, the executor that was launched to run 'docker wait container-id' was
still waiting a RunTaskMessage to be called for it to start docker-wait, and it
just sits there waiting for a RunTaskMessage to happen, while in the slave if
we cannot update the containerizer we simply call destroy on the containerizer
and trust that the executor will clean itself up.
I think the fix for this is probably two folds:
- I think we shouldn't fail update if the docker container exits, which means
we should not just return Failure. I think what we could do is to perform an
extra os::exists check when cgroups update call failed just to verify that the
pid exited, and if it doesn't exist we return Nothing() instead.
- The executor that Docker containerizer launched should get removed by the
containerizer->destroy to ensure we don't keep idle executors around. This
should be fixed in the future where we move docker->run right inside of the
executor, so it will remove itself when the container dies.
> Tasks getting stuck in staging
> ------------------------------
>
> Key: MESOS-2583
> URL: https://issues.apache.org/jira/browse/MESOS-2583
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.22.0
> Reporter: Brenden Matthews
> Attachments:
> Justin-Bieber_The-Beliebers-Want-to-Believe-2-650x406.jpg, Screen Shot
> 2015-03-26 at 11.59.33 AM.png, Screen Shot 2015-03-30 at 2.04.14 PM.png,
> log.txt
>
>
> Tasks occasionally become stuck in the `TASK_STAGING` state after launching.
> It appears that this affects both Docker and non-Docker tasks, especially
> those which start up and fail immediately. Attached is a sample of the slave
> log as well as screenshots from a testing cluster showing the tasks which are
> stuck in staging, and then a number of failed tasks which occurs after
> restarting the slave process. Justin Bieber is provided for scale.
> This may be related to MESOS-1837, and quite possibly the same issue, but it
> remains unclear.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)