[ https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212793#comment-17212793 ]
Qian Zhang commented on MESOS-10192: ------------------------------------ commit 301902be4f1332799cf3b3242cd29b4907c21c09 Author: Qian Zhang Date: Sat Oct 10 15:04:57 2020 +0800 Ignored the directoy `/dev/nvidia-caps` when globing Nvidia GPU devices. The directory `/dev/nvidia-caps` was introduced in CUDA 11.0, just ignore it since we only care about the Nvidia GPU device files. Review: https://reviews.apache.org/r/72945 > Recent Nvidia CUDA changes break Mesos GPU support > -------------------------------------------------- > > Key: MESOS-10192 > URL: https://issues.apache.org/jira/browse/MESOS-10192 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, gpu > Reporter: Greg Mann > Assignee: Qian Zhang > Priority: Major > Labels: GPU, containerization, containerizer, gpu > > Recently it seems that the layout of the Nvidia device files has changed: > https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ > This prevents GPU tasks from launching: > {noformat} > W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container > c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: > Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a > special file: /dev/nvidia-caps > {noformat} > due to this code, which detects the nvidia device files: > https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454 -- This message was sent by Atlassian Jira (v8.3.4#803005)