To Ganglia List,
First of all, I've managed to generate aggregate GPU metrics for systems with
multi-GPUs within each system. As I last state in my previous posting, I took
Bernard Li's original nvidia.py script and customized it. It's "ugly", but it
works. :). I also figured out why I wasn't getting any graphs even though
I could see my modified nvidia.py script was working. I didn't properly
update my /etc/ganglia/conf.d/nvidia.pyconf file which does the collection of
the metrics. So I've got this working now.
I know there might be some interest in how I gathered the metrics, so I will
say this,
1. Get nvidia.py.
2. Customize it. The tact I took was to use the pynvml modules and
query the metrics I wanted. Since we have a mix of different GPUs within our
organization, I had to put some logic in my customized nvidia.py script to test
for the type of GPU in order to ensure I could collect the appropriate metrics
for each specific type of GPU. Then I would just either add up the total GPU
memory, Avg GPU utilization, etc for all GPUs in a system.
3. Ensure that you've updated the /etc/ganglia/conf.d/nvidia.pyconf file
to ensure you're collecting your new metrics.
4. I was able to some of the example .json reports to get some of the
reports I wanted to display. In one case, instead of using a json report, I
had to resort to using a .php report for "avg GPU utilization.
I would consider posting my modified script, but I feel it is not general
enough to be used since it is so customize for the HPC environment I'm using it
in currently.
Thanks to all in the community for any and all help that was provided.
Wayne Lee
From: Lee, Wayne
Sent: Tuesday, May 07, 2013 4:14 PM
To: ganglia-general@lists.sourceforge.net
Subject: RE: [Ganglia-general] Aggregating all GPU metrics into single graph.
To Ganglia List,
I've managed to take the "nvidia.py" Python script (i.e. module) that was
originally created by Bernard Li and "hack" it so that I could get the script
to provided aggregated GPU metrics for nodes with multiple GPUs in them.
The "hacked" script is ugly, since I really didn't know much about Python, but
it works when I run the script on several of our multi-GPU nodes within the
Python interpreter.
I added the "nvidia.py" module to the rest of the Python modules which gmond
loads on each Ganglia client node. I'm getting some graphs for some of the
new metrics I created, but I'm not getting others. I don't know why.
I've run "gmond -m" on several different Ganglia clients and "gmond" reports
that it knows about the metrics I've added. I've also run "gmond -d 2" to
help debug my problem. I see data from some of my custom GPU metrics being
sent out, but not other ones. I'm at a loss to understand why? My version
of the nvidia.py Python module does group my custom GPU metrics into a group
called "gpu_node". I see a few of the graphs for my metrics show up in the
"gpu_node" group for nodes in the "Host view", but not the rest.
For the metrics I do see, I do see the associated *.rrd files created in the
/var/lib/ganglia/rrds directory. Although I do see an *.rrd file created but I
don't see a graph.
The "gmond" process runs as the user "nobody". I thought that this might be
an issue, but I don't think so given that the updated script I wrote is
generating other GPU metrics, but just not all of them. The GPU metrics are
using the pynvml.py modules to query GPU information from all the GPUs.
I am wondering if there is something I can check to help debug this problem?
I should point out that the original script Bernard Li wrote gathered GPU
metrics for each GPU you would have in a single node. We have nodes here
that have anywhere from 4 to 16 GPUs in them each. So the number of metrics
generated by Bernard's original module can be quite a lot. Now that I've
modified it, there are a lot more metrics per node generated. Could this be
an issue?
Kind regards,
-----
Wayne Lee
This e-mail and any attachments are for the sole use of the intended
recipient(s) and may contain information that is confidential. If you are not
the intended recipient(s) and have received this e-mail in error, please
immediately notify the sender by return e-mail and delete this e-mail from your
computer. Any distribution, disclosure or the taking of any other action by
anyone other than the intended recipient(s) is strictly prohibited
.
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general