Greg Mann created MESOS-10192:
---------------------------------

             Summary: Recent Nvidia CUDA changes break Mesos GPU support
                 Key: MESOS-10192
                 URL: https://issues.apache.org/jira/browse/MESOS-10192
             Project: Mesos
          Issue Type: Bug
          Components: agent, containerization, gpu
            Reporter: Greg Mann


Recently it seems that the layout of the Nvidia device files has changed:  
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

This prevents GPU tasks from launching:
{noformat}
W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container 
c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: 
Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a 
special file: /dev/nvidia-caps
{noformat}

due to this code, which detects the nvidia device files: 
https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to