[ 
https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Bethune-Waddell updated MESOS-6383:
-----------------------------------------
    Description: 
We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We 
are not in a position to upgrade the Nvidia drivers in the near future, and are 
currently at driver version 319.72

When attempting to launch an agent with the following command and take 
advantage of Nvidia GPU support (master address elided):

bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> 
--work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}

I receive the following error message:

bq. {{Failed to create a containerizer: Failed call to 
NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load 
symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 
'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : 
/usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}

Based on the change log for the NVML module, it seems that 
{{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and 
later as per info under the [Changes between NVML v5.319 Update and 
v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] heading 
in the NVML API reference.

Is there is an alternate method of obtaining this information at runtime to 
enable support for older versions of the Nvidia driver? Based on discussion in 
the design document, obtaining this information from the {{nvidia-smi}} command 
output is a feasible alternative. 

I am willing to submit a PR that amends the behaviour of {{NvidiaGpuAllocator}} 
such that it first attempts calls to {{nvml::nvmlGetDeviceMinorNumber}} via 
libnvidia-ml, and if the symbol cannot be found, falls back on 
{{--nvidia-smi="/path/to/nvidia-smi"}} option obtained from mesos-agent if 
provided or attempts to run {{nvidia-smi}} if found on path and parses the 
output to obtain this information. Otherwise, raise an exception indicating all 
this was attempted.

Would a function or class for parsing {{nvidia-smi}} output be a useful 
contribution?

  was:
We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We 
are not in a position to upgrade the Nvidia drivers in the near future, and are 
currently at driver version 319.72

When attempting to launch an agent with the following command and take 
advantage of Nvidia GPU support (master address elided):

bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> 
--work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}

I receive the following error message:

bq. {{Failed to create a containerizer: Failed call to 
NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load 
symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 
'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : 
/usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}

Based on the change log for the NVML module, it seems that 
{{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and 
later as per info under the [Changes between NVML v5.319 Update and 
v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] heading 
in the NVML API reference.

Is there is an alternate method of obtaining this information at runtime to 
enable support for older versions of the Nvidia driver? A modest search has not 
yet yielded much insight on a path forward.


> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - 
> can the device minor number be ascertained reliably using an older set of API 
> calls?
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-6383
>                 URL: https://issues.apache.org/jira/browse/MESOS-6383
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 1.0.1
>            Reporter: Dylan Bethune-Waddell
>            Priority: Minor
>              Labels: gpu
>
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We 
> are not in a position to upgrade the Nvidia drivers in the near future, and 
> are currently at driver version 319.72
> When attempting to launch an agent with the following command and take 
> advantage of Nvidia GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> 
> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to 
> NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load 
> symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 
> 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : 
> /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that 
> {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and 
> later as per info under the [Changes between NVML v5.319 Update and 
> v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] 
> heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to 
> enable support for older versions of the Nvidia driver? Based on discussion 
> in the design document, obtaining this information from the {{nvidia-smi}} 
> command output is a feasible alternative. 
> I am willing to submit a PR that amends the behaviour of 
> {{NvidiaGpuAllocator}} such that it first attempts calls to 
> {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot 
> be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option 
> obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if 
> found on path and parses the output to obtain this information. Otherwise, 
> raise an exception indicating all this was attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful 
> contribution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to