[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360613#comment-17360613
 ] 

Gregoire Seux commented on MESOS-8038:
--------------------------------------

We encountered that issue (with cpus) when working on a isolator dealing with 
cpusets (which are not a compressible resource): mesos accept tasks for 
resources before having cleaned (ran the cleanup of every isolator) tasks that 
were using those resources. It's true for cpu & gpus. We worked around this in 
isolator code with a heuristic (a cpuset cgroup with no process is likely to be 
to-be-cleaned cpuset and can be reused).

There is probably a default on the synchronization of operations in the mesos 
agent.
It can probably be addressed at the general level (but this is hard) or worked 
around in most isolator (not very complex but it depends on each isolator).

> Launching GPU task sporadically fails.
> --------------------------------------
>
>                 Key: MESOS-8038
>                 URL: https://issues.apache.org/jira/browse/MESOS-8038
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, gpu
>    Affects Versions: 1.4.0
>            Reporter: Sai Teja Ranuva
>            Assignee: Zhitao Li
>            Priority: Critical
>         Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to