[
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360613#comment-17360613
]
Gregoire Seux commented on MESOS-8038:
--------------------------------------
We encountered that issue (with cpus) when working on a isolator dealing with
cpusets (which are not a compressible resource): mesos accept tasks for
resources before having cleaned (ran the cleanup of every isolator) tasks that
were using those resources. It's true for cpu & gpus. We worked around this in
isolator code with a heuristic (a cpuset cgroup with no process is likely to be
to-be-cleaned cpuset and can be reused).
There is probably a default on the synchronization of operations in the mesos
agent.
It can probably be addressed at the general level (but this is hard) or worked
around in most isolator (not very complex but it depends on each isolator).
> Launching GPU task sporadically fails.
> --------------------------------------
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
> Issue Type: Bug
> Components: containerization, gpu
> Affects Versions: 1.4.0
> Reporter: Sai Teja Ranuva
> Assignee: Zhitao Li
> Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt,
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time.
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens
> even before the the job starts. A little search in the code base points me to
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)