[
https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609351#comment-15609351
]
Joseph Wu commented on MESOS-6383:
----------------------------------
Sounds like this is going to be a documentation "fix". i.e.
https://reviews.apache.org/r/53201/
> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber -
> can the device minor number be ascertained reliably using an older set of API
> calls?
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-6383
> URL: https://issues.apache.org/jira/browse/MESOS-6383
> Project: Mesos
> Issue Type: Improvement
> Affects Versions: 1.0.1
> Reporter: Dylan Bethune-Waddell
> Assignee: Kevin Klues
> Priority: Minor
> Labels: gpu
>
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We
> are not in a position to upgrade the Nvidia drivers in the near future, and
> are currently at driver version 319.72
> When attempting to launch an agent with the following command and take
> advantage of Nvidia GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort>
> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to
> NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load
> symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol
> 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' :
> /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that
> {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and
> later as per info under the [Changes between NVML v5.319 Update and
> v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log]
> heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to
> enable support for older versions of the Nvidia driver? Based on discussion
> in the design document, obtaining this information from the {{nvidia-smi}}
> command output is a feasible alternative.
> I am willing to submit a PR that amends the behaviour of
> {{NvidiaGpuAllocator}} such that it first attempts calls to
> {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot
> be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option
> obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if
> found on path and parses the output to obtain this information. Otherwise,
> raise an exception indicating all this was attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful
> contribution?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)