Re: [Ganglia-general] Aggregating all GPU metrics into single graph.

Lee, Wayne Thu, 16 May 2013 09:58:00 -0700

To Ganglia List,

First of all, I've managed to generate aggregate GPU metrics for systems with 
multi-GPUs within each system.   As I last state in my previous posting, I took 
Bernard Li's original nvidia.py script and customized it.   It's "ugly", but it 
works.  :).     I also figured out why I wasn't getting any graphs even though 
I could see my modified nvidia.py script was working.   I didn't properly 
update my /etc/ganglia/conf.d/nvidia.pyconf file which does the collection of 
the metrics.    So I've got this working now.


I know there might be some interest in how I gathered the metrics, so I will 
say this,


1.       Get nvidia.py.

2.       Customize it.   The tact I took was to use the pynvml modules and 
query the metrics I wanted.   Since we have a mix of different GPUs within our 
organization, I had to put some logic in my customized nvidia.py script to test 
for the type of GPU in order to ensure I could collect the appropriate metrics 
for each specific type of GPU.   Then I would just either add up the total GPU 
memory, Avg GPU utilization, etc for all GPUs in a system.

3.       Ensure that you've updated the /etc/ganglia/conf.d/nvidia.pyconf file 
to ensure you're collecting your new metrics.

4.       I was able to some of the example .json reports to get some of the 
reports I wanted to display.   In one case, instead of using a json report, I 
had to resort to using a .php report for "avg GPU utilization.

I would consider posting my modified script, but I feel it is not general 
enough to be used since it is so customize for the HPC environment I'm using it 
in currently.

Thanks to all in the community for any and all help that was provided.

Wayne Lee

From: Lee, Wayne
Sent: Tuesday, May 07, 2013 4:14 PM
To: ganglia-general@lists.sourceforge.net
Subject: RE: [Ganglia-general] Aggregating all GPU metrics into single graph.

To Ganglia List,

I've managed to take the "nvidia.py" Python script (i.e. module) that was 
originally created by Bernard Li and "hack" it so that I could get the script 
to provided aggregated GPU metrics for nodes with multiple GPUs in them.

The "hacked" script is ugly, since I really didn't know much about Python, but 
it works when I run the script on several of our multi-GPU nodes within the 
Python interpreter.

I added the "nvidia.py" module to the rest of the Python modules which gmond 
loads on each Ganglia client node.    I'm getting some graphs for some of the 
new metrics I created, but I'm not getting others.   I don't know why.

I've run "gmond -m" on several different Ganglia clients and "gmond" reports 
that it knows about the metrics I've added.    I've also run "gmond -d 2" to 
help debug my problem.  I see data from some of my custom GPU metrics being 
sent out, but not other ones.    I'm at a loss to understand why?    My version 
of the nvidia.py Python module does group my custom GPU metrics into a group 
called "gpu_node".   I see a few of the graphs for my metrics show up in the 
"gpu_node" group for nodes in the "Host view", but not the rest.

For the metrics I do see, I do see the associated *.rrd files created in the 
/var/lib/ganglia/rrds directory.  Although I do see an *.rrd file created but I 
don't see a graph.

The "gmond" process runs as the user "nobody".   I thought that this might be 
an issue, but I don't think so given that the updated script I wrote is 
generating other GPU metrics, but just not all of them.   The GPU metrics are 
using the pynvml.py modules to query GPU information from all the GPUs.

I am wondering if there is something I can check to help debug this problem?

I should point out that the original script Bernard Li wrote gathered GPU 
metrics for each GPU you would have in a single node.    We have nodes here 
that have anywhere from 4 to 16 GPUs in them each.   So the number of metrics 
generated by Bernard's original module can be quite a lot.  Now that I've 
modified it, there are a lot more metrics per node generated.   Could this be 
an issue?

Kind regards,

-----
Wayne Lee


This e-mail and any attachments are for the sole use of the intended 
recipient(s) and may contain information that is confidential.  If you are not 
the intended recipient(s) and have received this e-mail in error, please 
immediately notify the sender by return e-mail and delete this e-mail from your 
computer. Any distribution, disclosure or the taking of any other action by 
anyone other than the intended recipient(s) is strictly prohibited

.

------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Aggregating all GPU metrics into single graph.

Reply via email to