Kevin Klues commented on MESOS-6383:

Hi Dylan,

Thanks for reporting this.  I see the problem, but it's not immediately clear 
to me what the solution would be.  We don't want to just parse the output of 
{{nvidia-smi}} because that also changes from version to version (we talked 
with Nvidia directly about this, and they *highly* discouraged trying to rely 
on the output of {{nvidia-smi}}).

One thing I could imagine doing is to change the code that attempts to load the 
{{nvmlDeviceGetMinorNumber}} symbol from NVML.  It could attempt to load the 
symbol, and if it failed, it would fall back to implementing our wrapper 
function for {{nvml::deviceGetMinorNumber()}} using a different method (meaning 
there would be no changes to {{NvidiaGpuAllocator}}. Do you know what (if any) 
methods were available in the 5.319 driver to determine the minor number? How 
does the old {{nvidia-smi}} determine them?

Also, are you sure this is the only symbol we aren't able to load from the old 
driver, or did you just hit this one first?

> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - 
> can the device minor number be ascertained reliably using an older set of API 
> calls?
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: MESOS-6383
>                 URL: https://issues.apache.org/jira/browse/MESOS-6383
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 1.0.1
>            Reporter: Dylan Bethune-Waddell
>            Priority: Minor
>              Labels: gpu
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We 
> are not in a position to upgrade the Nvidia drivers in the near future, and 
> are currently at driver version 319.72
> When attempting to launch an agent with the following command and take 
> advantage of Nvidia GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> 
> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to 
> NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load 
> symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 
> 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : 
> /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that 
> {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and 
> later as per info under the [Changes between NVML v5.319 Update and 
> v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] 
> heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to 
> enable support for older versions of the Nvidia driver? Based on discussion 
> in the design document, obtaining this information from the {{nvidia-smi}} 
> command output is a feasible alternative. 
> I am willing to submit a PR that amends the behaviour of 
> {{NvidiaGpuAllocator}} such that it first attempts calls to 
> {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot 
> be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option 
> obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if 
> found on path and parses the output to obtain this information. Otherwise, 
> raise an exception indicating all this was attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful 
> contribution?

This message was sent by Atlassian JIRA

Reply via email to