[ https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606913#comment-15606913 ]
Kevin Klues commented on MESOS-6383: ------------------------------------ I think we need to loop some of the nvidia guys in on this. [~vditya], [~rph...@nvidia.com], [~rtodd], [~exxo] can one of you comment on this? > NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - > can the device minor number be ascertained reliably using an older set of API > calls? > ---------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MESOS-6383 > URL: https://issues.apache.org/jira/browse/MESOS-6383 > Project: Mesos > Issue Type: Improvement > Affects Versions: 1.0.1 > Reporter: Dylan Bethune-Waddell > Priority: Minor > Labels: gpu > > We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We > are not in a position to upgrade the Nvidia drivers in the near future, and > are currently at driver version 319.72 > When attempting to launch an agent with the following command and take > advantage of Nvidia GPU support (master address elided): > bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> > --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}} > I receive the following error message: > bq. {{Failed to create a containerizer: Failed call to > NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load > symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol > 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : > /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}} > Based on the change log for the NVML module, it seems that > {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and > later as per info under the [Changes between NVML v5.319 Update and > v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] > heading in the NVML API reference. > Is there is an alternate method of obtaining this information at runtime to > enable support for older versions of the Nvidia driver? Based on discussion > in the design document, obtaining this information from the {{nvidia-smi}} > command output is a feasible alternative. > I am willing to submit a PR that amends the behaviour of > {{NvidiaGpuAllocator}} such that it first attempts calls to > {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot > be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option > obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if > found on path and parses the output to obtain this information. Otherwise, > raise an exception indicating all this was attempted. > Would a function or class for parsing {{nvidia-smi}} output be a useful > contribution? -- This message was sent by Atlassian JIRA (v6.3.4#6332)