George, Scott and Ethan -

I am the Alliance Manager for cluster management and job scheduling tools at 
NVIDIA.

We recently introduced a new (free) software product for GPU management called 
NVIDIA DCGM (Data Center GPU Manager), which enables:


-        Active Health Monitoring

-        Active Diagnostics and System Validation

-        Policy and Group Configuration Management

-        Power and Clock Management

This product supplements NVML and SMI, which you already leverage in the GPU 
Sensor Monitoring plug-in and the nvidia-smi plugin.

More information is available at: 
https://developer.nvidia.com/data-center-gpu-manager-dcgm . (Note that you need 
to be a member of the NVIDIA Developer Zone if you want to download the bits.)

The monitoring components of this, at minimum, may be interesting for you. We 
had two customer account managers ask about Nagios support at a meeting last 
week.

I support a number of partners in the space, enabling technical calls, and 
sometimes providing the latest GPUs for testing and software validation. If I 
can help you support NVIDIA GPUs more productively and fully, we should talk.

Thanks.

      John Coombs
      NVIDIA Tesla Alliance Management
      jcoo...@nvidia.com<mailto:jcoo...@nvidia.com>


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to