Dylan Bethune-Waddell created MESOS-6383:
--------------------------------------------
Summary: NvidiaGpuAllocator::resources cannot load symbol
nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably
using an older set of API calls?
Key: MESOS-6383
URL: https://issues.apache.org/jira/browse/MESOS-6383
Project: Mesos
Issue Type: Improvement
Affects Versions: 1.0.1
Reporter: Dylan Bethune-Waddell
Priority: Minor
We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We
are not in a position to upgrade the Nvidia drivers in the near future, and are
currently at driver version 319.72
When attempting to launch an agent with the following command and take
advantage of Nvidia GPU support (master address elided):
bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort>
--work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}}
I receive the following error message:
bq. {{Failed to create a containerizer: Failed call to
NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load
symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol
'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' :
/usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}}
Based on the change log for the NVML module, it seems that
{{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and
later as per info under the [Changes between NVML v5.319 Update and
v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] heading
in the NVML API reference.
Is there is an alternate method of obtaining this information at runtime to
enable support for older versions of the Nvidia driver? A modest search has not
yet yielded much insight on a path forward.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)