George, Scott and Ethan - I am the Alliance Manager for cluster management and job scheduling tools at NVIDIA.
We recently introduced a new (free) software product for GPU management called NVIDIA DCGM (Data Center GPU Manager), which enables: - Active Health Monitoring - Active Diagnostics and System Validation - Policy and Group Configuration Management - Power and Clock Management This product supplements NVML and SMI, which you already leverage in the GPU Sensor Monitoring plug-in and the nvidia-smi plugin. More information is available at: https://developer.nvidia.com/data-center-gpu-manager-dcgm . (Note that you need to be a member of the NVIDIA Developer Zone if you want to download the bits.) The monitoring components of this, at minimum, may be interesting for you. We had two customer account managers ask about Nagios support at a meeting last week. I support a number of partners in the space, enabling technical calls, and sometimes providing the latest GPUs for testing and software validation. If I can help you support NVIDIA GPUs more productively and fully, we should talk. Thanks. John Coombs NVIDIA Tesla Alliance Management jcoo...@nvidia.com<mailto:jcoo...@nvidia.com> ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------