[Ganglia-general] Aggregating all GPU metrics into single graph.

Lee, Wayne Sun, 21 Apr 2013 12:12:58 -0700

To list,

Let me describe the setup I administer  before asking my questions.


Currently I have Ganglia 3.1.7 running with Ganglia-web-3.5.7 which is 
monitoring roughly a 500 node CPU/GPGPU Linux cluster.   Our Ganglia setup 
consists of one "grid" (i.e. one gmetad process) which represents all our nodes 
within our Linux cluster.  Within our defined "grid" view, the nodes are 
grouped into "clusters".  The "clusters" views are the different hardware 
platforms we have.   So, one cluster would be the "Dell group",  the second 
would be the "HP group", and the third would be the "Appro group".    Each node 
within our Linux cluster may each have 4, 8 or 16 GPUs.   I'm currently using 
the NVML Python Nvidia module to gather various metrics for each GPU for each 
of the 500 nodes in our cluster.   Therefore within my 
/var/lib/ganglia/rrds/Dell_group/node1, you would find the following rrd files 
which represent the metrics for each GPU on node1.

gpu0_graphics_speed.rrd
gpu0_mem_speed.rrd
gpu0_mem_total.rrd
gpu0_mem_used.rrd
gpu0_mem_util.rrd
gpu0_sm_speed.rrd
gpu0_temp.rrd
gpu0_util.rrd
gpu1_graphics_speed.rrd
gpu1_mem_speed.rrd
gpu1_mem_total.rrd
gpu1_mem_used.rrd
gpu1_mem_util.rrd
gpu1_sm_speed.rrd
gpu1_temp.rrd
gpu1_util.rrd
gpu2_graphics_speed.rrd
gpu2_mem_speed.rrd
gpu2_mem_total.rrd
gpu2_mem_used.rrd
gpu2_mem_util.rrd
gpu2_sm_speed.rrd
gpu2_temp.rrd
gpu2_util.rrd
gpu3_graphics_speed.rrd
gpu3_mem_speed.rrd
gpu3_mem_total.rrd
gpu3_mem_used.rrd
gpu3_mem_util.rrd
gpu3_sm_speed.rrd
gpu3_temp.rrd
gpu3_util.rrd
gpu_num.rrd

Questions/Comments:
---------------------------

-       What I would like to do for example is take the total GPU utilization 
(i.e. gpu#_util.rrd) for each and every GPU on every node within our Linux 
cluster and display it a graph called "Global Grid GPU".  Eventually I would 
like to extend this to GPU memory for all GPUs combined and possibly other GPU 
metrics.    What is the best way for me to achieve this?   Since most of our 
computational work is mostly done on our GPUs, we would like to have a single 
graph which shows GPU utilization to present to our executive management.   
That way they can see how much our GPUs are being utilized.

-       I've been attempting to read through the Gangalia book and whatever 
documentation I can find and it looks like I would have to create a .php or 
.json script which would generate a report to begin with.   That script would 
have to be placed in the /var/www/html/ganglia-web/graph.d directory.

-       Would I need to merge all of the gpu#_util.rrd files into one rrd file 
called gpu_util.rrd for example and the create a .php script that would extract 
the necessary information from the merged gpu_util.rrd file?

-       I'm not a .php/.json expert nor am I an expert with RRDtool.    
However, I'm willing to do some "hacking" to make it work if I could get some 
idea of what way is the best way to proceed?

Thanks in advance for any comments/thoughts.

Regards,

Wayne Lee


This e-mail and any attachments are for the sole use of the intended 
recipient(s) and may contain information that is confidential.  If you are not 
the intended recipient(s) and have received this e-mail in error, please 
immediately notify the sender by return e-mail and delete this e-mail from your 
computer. Any distribution, disclosure or the taking of any other action by 
anyone other than the intended recipient(s) is strictly prohibited

.

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] Aggregating all GPU metrics into single graph.

Reply via email to