[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595136#comment-16595136
 ] 

Justin Pinkul commented on MESOS-8038:
--------------------------------------

In our GPU cluster we have seen many cases where a task acquires resources on a 
GPU and then gets stuck in the D state forever. Being stuck in the D state is 
generally caused by bugs in the GPU driver or a NFS driver. When these types of 
driver bugs are hit Linux has no way to recover and the only way to kill the 
process is to restart the machine.

In our GPU cluster we handle these issues by automatically detecting them and 
putting the machine into maintenance mode with a start time of now and an end 
time of one year from now. This prevents the machine from failing tasks until 
our operations team has a chance to investigate what caused the task to get 
stuck in the D state forever.

I think the only graceful way Mesos could handle this state is to offer less 
GPUs until the machine is restarted.

> Launching GPU task sporadically fails.
> --------------------------------------
>
>                 Key: MESOS-8038
>                 URL: https://issues.apache.org/jira/browse/MESOS-8038
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, containerization, gpu
>    Affects Versions: 1.4.0
>            Reporter: Sai Teja Ranuva
>            Assignee: Zhitao Li
>            Priority: Critical
>         Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to