Hello,

I'm managing a cluster  and we use ganglia for monitoring.

As we now have some gpus (CUDA), I have tried to install this software:

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia

In the web interface, I can see the graphs related to gpu, but there is no
data inside.

I'm using ganglia 3.7.2

I have installed the package nvidia-ml-py

https://pypi.python.org/pypi/nvidia-ml-py/7.352.0

We are using gmond on each node (mute, and one gmond on the monitor node
which agegate the data from all the node's gmond, and this master gmond is
polled by gmetad.

gmond config on monitor:
[...]
mute = no
deaf = no
allow_extra_data = yes
[...]

gmond config on nodes:
[...]
mute = no
deaf =yes
allow_extra_data = yes
[..]


On the node with the gpu, I have the following files:

/opt/ganglia/ganglia-3.7.2/lib64/ganglia/python_modules/
├── nvidia.py
├── nvidia_smi.py
├── pynvml.py

/opt/ganglia/ganglia-3.7.2/etc/conf.d/
├── modpython.conf
└── nvidia.pyconf

content of modpython.conf:

modules {
  module {
    name = "python_module"
    path = "modpython.so"
    params = "/opt/ganglia/ganglia-3.7.2/lib64/ganglia/python_modules"
  }
}

include ("/opt/ganglia/ganglia-3.7.2/etc/conf.d/*.pyconf")


the file nvidia.pyconf is the original version.

If I start gmond on this node in foreground debug, it "seems" it's working
fine.

[...]
metric 'gpu2_ecc_sb_error' being collected now
metric 'gpu2_ecc_sb_error' has value_threshold 1.000000
sent message 'gpu1_graphics_clock_report' of length 72 with 0 errors
sent message 'gpu3_graphics_clock_report' of length 72 with 0 errors
[...]

In the monitor server, I have these files related to gpu:

/var/www/html/ganglia/graph.d/
├── gpu_common.php
├── gpu_graphics_clock_report.php
├── gpu_mem_clock_report.php
├── gpu_power_usage_report.php
├── gpu_power_violation_report.php
├── gpu_sm_clock_report.php

As it's not working, how can I be sure that gmond of gpu node is actually
sending some data?

Do I have to install something about gpu on the master gmond or on gmetad?

Any clue how to thoubleshoot?

Many thanks

-- 
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737
yann.sa...@unige.ch - www.unige.ch
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to