[jira] [Commented] (MESOS-2583) Tasks getting stuck in staging

Timothy Chen (JIRA) Wed, 01 Apr 2015 13:54:13 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391439#comment-14391439
 ]


Timothy Chen commented on MESOS-2583:
-------------------------------------

I think I understand what's going on, I think this most likely just affects the 
Docker containerizer.
When a task is launched detached and fails before we launch the executor, the 
subsequent update call to update the resources fails as the docker container 
isn't running as we try to find the pid and it won't be able to find the 
cgroups path as it was removed.
However, the executor that was launched to run 'docker wait container-id' was 
still waiting a RunTaskMessage to be called for it to start docker-wait, and it 
just sits there waiting for a RunTaskMessage to happen, while in the slave if 
we cannot update the containerizer we simply call destroy on the containerizer 
and trust that the executor will clean itself up.
I think the fix for this is probably two folds: 
- I think we shouldn't fail update if the docker container exits, which means 
we should not just return Failure. I think what we could do is to perform an 
extra os::exists check when cgroups update call failed just to verify that the 
pid exited, and if it doesn't exist we return Nothing() instead.
- The executor that Docker containerizer launched should get removed by the 
containerizer->destroy to ensure we don't keep idle executors around. This 
should be fixed in the future where we move docker->run right inside of the 
executor, so it will remove itself when the container dies.


> Tasks getting stuck in staging
> ------------------------------
>
>                 Key: MESOS-2583
>                 URL: https://issues.apache.org/jira/browse/MESOS-2583
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.0
>            Reporter: Brenden Matthews
>         Attachments: 
> Justin-Bieber_The-Beliebers-Want-to-Believe-2-650x406.jpg, Screen Shot 
> 2015-03-26 at 11.59.33 AM.png, Screen Shot 2015-03-30 at 2.04.14 PM.png, 
> log.txt
>
>
> Tasks occasionally become stuck in the `TASK_STAGING` state after launching. 
> It appears that this affects both Docker and non-Docker tasks, especially 
> those which start up and fail immediately. Attached is a sample of the slave 
> log as well as screenshots from a testing cluster showing the tasks which are 
> stuck in staging, and then a number of failed tasks which occurs after 
> restarting the slave process. Justin Bieber is provided for scale.
> This may be related to MESOS-1837, and quite possibly the same issue, but it 
> remains unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2583) Tasks getting stuck in staging

Reply via email to