Dear all:
Just a quick note letting you guys know that we now have a python
module for monitoring NVIDIA GPUs using the newly released Python
bindings for NVML:
https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
If you are running a cluster with NVIDIA GPUs, please download
. For more information on what metrics are supported on what
models, please refer to NVML documentation
After following the above procedure respective services gmond and gmetad restart
could not get the GPU metrics in Ganglia.
Thanks & Regards,
Hridyesh kumar
System Engi
/pypi/nvidia-ml-py/
requires Python to be newer than 2.4 - following
Phil's instructions in a recent email, I got
Python 2.7 and 3.x to install; and used that to
get these Python bindings for NVML to install.
I then followed the instructions in 'Ganglia/gmond
python modules' page
https
be a good idea. My schedule is mostly open next week. When
are others free? I will brush up on sflow by then.
NVML and the Python metric module are tested at NVIDIA on Windows and
Linux, but not within Cygwin. The process will be easier/faster on the
NVML side if we keep Cygwin out of the loop
ll metrics that the management library could detect for your GPU
are collected. For more information on what metrics are supported on what
models, please refer to NVML documentation
After following the above procedure respective services gmond and gmetad restart
could not get the GPU metrics in Ganglia.
I'm trying to implement the instructions given
here
http://developer.nvidia.com/ganglia-monitoring-system
on one of our Rocks 5.4.2 clusters that has 2 GPU
cards in every compute node.
Part #1: Python bindings for the NVML
http://pypi.python.org/pypi/nvidia-ml-py/
This requires Python
Hi Robert,
sFlow is a very simple protocol - an sFlow agent periodically sends
XDR encoded structures over UDP. Each structure has a tag and a
length, making the protocol extensible.
In the short term, it would make sense is to define an sFlow structure
to carry the current NVML metrics and tag
Hey,
A meeting may be a good idea. My schedule is mostly open next week. When are
others free? I will brush up on sflow by then.
NVML and the Python metric module are tested at NVIDIA on Windows and Linux,
but not within Cygwin. The process will be easier/faster on the NVML side if
we
node
within our Linux cluster may each have 4, 8 or 16 GPUs. I'm currently using
the NVML Python Nvidia module to gather various metrics for each GPU for each
of the 500 nodes in our cluster. Therefore within my
/var/lib/ganglia/rrds/Dell_group/node1, you would find the following rrd files
take you point about re-using the existing GPU module and gmetric,
unfortunately I don't have experience with Python. My plan is to write
something in C to export the nvml metrics, with various output options. We will
then decide whether to call this new code from existing gmond 3.1 via gmetric
,
unfortunately I don't have experience with Python. My plan is to write
something in C to export the nvml metrics, with various output options. We
will then decide whether to call this new code from existing gmond 3.1 via
gmetric, new (if we get it working) gmond 3.4, or one of our existing
metrics.
I take you point about re-using the existing GPU module and gmetric,
unfortunately I don't have experience with Python. My plan is to write
something in C to export the nvml metrics, with various output options. We
will then decide whether to call this new code from existing gmond 3.1
the nvml metrics, with various output options. We
will then decide whether to call this new code from existing gmond 3.1 via
gmetric, new (if we get it working) gmond 3.4, or one of our existing third
party tools - ITRS Geneous.
As regards your list of metrics they are pretty definitive, but I
all stored on an NFS filesystem
/nfs/data/ganglia/rrds.
- A single Apache web server running a single gmetad daemon
which collects data from 4 different clusters.
- Installed NVIDIA GPU Python Ganglia module plugin. This
requires the NVIDIA NVML Python binding nvidia-ml-py
:
https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
https://github.com/ganglia/ganglia_contrib
Longer term, it would make sense to extend Host sFlow to use the
C-based NVML API to extract and export metrics. This would be
straightforward - the Host sFlow agent uses native C APIs
/ganglia_contrib
Longer term, it would make sense to extend Host sFlow to use the
C-based NVML API to extract and export metrics. This would be
straightforward - the Host sFlow agent uses native C APIs on the
platforms it supports to extract metrics.
What would take some thought is developing standard
/gpu/nvidia
https://github.com/ganglia/ganglia_contrib
Longer term, it would make sense to extend Host sFlow to use the
C-based NVML API to extract and export metrics. This would be
straightforward - the Host sFlow agent uses native C APIs on the
platforms it supports to extract metrics
Dear all:
Just a quick note letting you guys know that we now have a python
module for monitoring NVIDIA GPUs using the newly released Python
bindings for NVML:
https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
If you are running a cluster with NVIDIA GPUs, please download
18 matches
Mail list logo