[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101162#comment-17101162
 ] 

Charles Natali commented on MESOS-8038:
---------------------------------------

The more I think about it the more I think that the current behavior of 
optimistically releasing the resources is very sub-optimal.

We've had cgroup destruction fail for various reasons in our cluster:
 * kernel bugs - see https://issues.apache.org/jira/browse/MESOS-10107
 * tasks stuck in uninterruptible sleep, e.g. NFS I/O

When this happens, it triggers at least the following problems:
 * this issue with GPUs, which cause all subsequent tasks scheduled on the host 
trying to use the GPU to fail, effectively a black hole
 * another problem where some stacks stuck in uninterruptible sleep were still 
consuming memory, so the agent overcommitted memory causing tasks to run OOM 
further down the line

 

"Leaking" CPU is mostly fine because it's a compressible resource and stuck 
tasks generally don't use it, but it's pretty bad for memory and GPU, causing 
errors which are hard to diagnose and automatically recover from.

 

 

> Launching GPU task sporadically fails.
> --------------------------------------
>
>                 Key: MESOS-8038
>                 URL: https://issues.apache.org/jira/browse/MESOS-8038
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, gpu
>    Affects Versions: 1.4.0
>            Reporter: Sai Teja Ranuva
>            Assignee: Zhitao Li
>            Priority: Critical
>         Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to