[Ganglia-general] Gmond Python module for monitoring NVIDIA GPUs
Dear all: Just a quick note letting you guys know that we now have a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia If you are running a cluster with NVIDIA GPUs, please download the module and give it a try. The module itself is pretty much feature complete, but the GUI/reports still need some work. It would be cool if we could extend it to work with the new gweb 2.0 as well. Please feel free to fork the repo and submit pull requests. Special thanks to the team at NVIDIA for their help in implementing the plugin and Jeremy Enos at NCSA for providing access to a NVIDIA GPU cluster. Cheers, Bernard -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] how to integrate nvdia gpu monitoring
Dear All, Followed the below procedures but failed to get the info of GPU in ganglia. Raw Blame History ? X ? NVIDIA GPU monitoring plugin for gmond == Installation instructions: * First install the Python Bindings for the NVIDIA Management Library: $ cd nvidia-ml-py-* $ sudo python setup.py install For the latest bindings see: http://pypi.python.org/pypi/nvidia-ml-py/ You can do a site install or place it in {libdir}/ganglia/python_modules * Copy python_modules/nvidia.py to {libdir}/ganglia/python_modules * Copy conf.d/nvidia.pyconf to /etc/ganglia/conf.d * Copy graph.d/* to {ganglia_webroot}/graph.d/ * A demo of what the GPU graphs look like is available here: http://ganglia.ddbj.nig.ac.jp/?c=research+month+gpu+queue=t135i=load_one=hour=by+name=4=2 By default all metrics that the management library could detect for your GPU are collected. For more information on what metrics are supported on what models, please refer to NVML documentation After following the above procedure respective services gmond and gmetad restart could not get the GPU metrics in Ganglia. Thanks & Regards, Hridyesh kumar System Engineer -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Gmond Python Module for Monitoring NVIDIA GPU
Before trying the instructions posted in http://developer.nvidia.com/ganglia-monitoring-system on one of our Rocks 5.4.2 clusters that has 2 GPU cards in every compute node, I tried them out on a standalone linux workstation, running RHEL 6.2 (no Rocks). Notes from that attempt are posted here: http://sgowtham.net/blog/2012/02/11/ganglia-gmond-python-module-for-monitoring-nvidia-gpu/ Now that I know it works as explained, I'd like to try this out on the aforementioned Rocks 5.4.2 cluster with GPUs. Python bindings for the NVIDIA Management Library http://pypi.python.org/pypi/nvidia-ml-py/ requires Python to be newer than 2.4 - following Phil's instructions in a recent email, I got Python 2.7 and 3.x to install; and used that to get these Python bindings for NVML to install. I then followed the instructions in 'Ganglia/gmond python modules' page https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia 'nvidia_smi.py' and 'pynvml.py' were copied to /opt/ganglia/lib64/ganglia/python_modules/ and so on. For some reason, the Ganglia metrics do not include any GPU related information from the compute nodes. If any of you have tried this on your cluster and got it to work, I'd greatly appreciate some direction. Thanks for your time and help. Best, g -- Gowtham Information Technology Services Michigan Technological University (906) 487/3593 http://www.it.mtu.edu/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Robert: When you said you tested the Python metric modules, did you just test the Python scripts under Windows or did you somehow got gmond compiled under Windows natively with Python support? Thanks, Bernard On Thursday, July 12, 2012, Robert Alexander wrote: Hey, A meeting may be a good idea. My schedule is mostly open next week. When are others free? I will brush up on sflow by then. NVML and the Python metric module are tested at NVIDIA on Windows and Linux, but not within Cygwin. The process will be easier/faster on the NVML side if we keep Cygwin out of the loop. -Robert -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org javascript:;] Sent: Thursday, July 12, 2012 10:49 AM To: Nigel LEACH Cc: lozgachev.i...@gmail.com javascript:;; ganglia-general@lists.sourceforge.net javascript:;; Peter Phaal; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com javascript:; wrote: Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using APR), the problem is with the 3.4 spin. -Original Message- From: lozgachev.i...@gmail.com javascript:; [mailto: lozgachev.i...@gmail.com javascript:;] Sent: 12 July 2012 11:54 To: Nigel LEACH Cc: peter.ph...@gmail.com javascript:;; ganglia-general@lists.sourceforge.net javascript:; Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com javascript:;: Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com javascript:; [mailto: peter.ph...@gmail.com javascript:;] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org javascript:;; ganglia-general@lists.sourceforge.net javascript:; Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidi a https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed
Re: [Ganglia-general] how to integrate nvdia gpu monitoring
Dear All, it was not properly copied. I recopied it and problem is solved. Thanks & Regards, Hridyesh kumar System Engineer From: Hridyesh Kumar <hridyesh.ku...@locuz.com> Sent: Tuesday, October 20, 2015 1:49 PM To: ganglia-general@lists.sourceforge.net Subject: [Ganglia-general] how to integrate nvdia gpu monitoring Dear All, Followed the below procedures but failed to get the info of GPU in ganglia. Raw Blame History ? X ? NVIDIA GPU monitoring plugin for gmond == Installation instructions: * First install the Python Bindings for the NVIDIA Management Library: $ cd nvidia-ml-py-* $ sudo python setup.py install For the latest bindings see: http://pypi.python.org/pypi/nvidia-ml-py/ You can do a site install or place it in {libdir}/ganglia/python_modules * Copy python_modules/nvidia.py to {libdir}/ganglia/python_modules * Copy conf.d/nvidia.pyconf to /etc/ganglia/conf.d * Copy graph.d/* to {ganglia_webroot}/graph.d/ * A demo of what the GPU graphs look like is available here: http://ganglia.ddbj.nig.ac.jp/?c=research+month+gpu+queue=t135i=load_one=hour=by+name=4=2 By default all metrics that the management library could detect for your GPU are collected. For more information on what metrics are supported on what models, please refer to NVML documentation After following the above procedure respective services gmond and gmetad restart could not get the GPU metrics in Ganglia. Thanks & Regards, Hridyesh kumar System Engineer -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Gmond Python Module for Monitoring NVIDIA GPU
I'm trying to implement the instructions given here http://developer.nvidia.com/ganglia-monitoring-system on one of our Rocks 5.4.2 clusters that has 2 GPU cards in every compute node. Part #1: Python bindings for the NVML http://pypi.python.org/pypi/nvidia-ml-py/ This requires Python to be newer than 2.4 - following Phil's instructions in a recent email, I got Python 2.7 and 3.x to install; and used that to get these Python bindings for NVML to install. Following are the commands I used on front end as well as the compute nodes: cd /share/apps/tmp/ wget http://pypi.python.org/packages/source/n/nvidia-ml-py/nvidia-ml-py-2.285.01.tar.gz cd /tmp/ tar -zxvf /share/apps/tmp/nvidia-ml-py-2.285.01.tar.gz cd nvidia-ml-py-2.285.01 /opt/python/bin/python2.7 setup.py install Process completes with no errors, with this output: running install running build running build_py running install_lib running install_egg_info Writing /opt/python/lib/python2.7/site-packages/nvidia_ml_py-2.285.01-py2.7.egg-info Part #2: Ganglia/gmond python modules web patch I downloaded ganglia-gmond_python_modules-3dfa553.tar.gz from https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia to /share/apps/tmp/ and the commands run afterwards on front end are as follows: cd /tmp/ cp nvidia-ml-py-2.285.01/nvidia_smi.py /opt/ganglia/lib64/ganglia/python_modules/ cp nvidia-ml-py-2.285.01/pynvml.py /opt/ganglia/lib64/ganglia/python_modules/ tar -zxvf /share/apps/tmp/ganglia-gmond_python_modules-3dfa553.tar.gz cd ganglia-gmond_python_modules-3dfa553 cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/ cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/ cp conf.d/nvidia.pyconf /opt/ganglia/etc/conf.d/ cp graph.d/*.php /var/www/html/ganglia/graph.d/ cd /var/ww/html/ganglia/ patch -p0 /tmp/ganglia-gmond_python_modules-3dfa553/gpu/nvidia/ganglia_web.patch /etc/init.d/gmetad restart /etc/init.d/gmond restart Then on the compute node, I did the following: cd /tmp/ cp nvidia-ml-py-2.285.01/nvidia_smi.py /opt/ganglia/lib64/ganglia/python_modu$ cp nvidia-ml-py-2.285.01/pynvml.py /opt/ganglia/lib64/ganglia/python_modules/ tar -zxvf /share/apps/tmp/ganglia-gmond_python_modules-3dfa553.tar.gz cd ganglia-gmond_python_modules-3dfa553 cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/ cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/ cp conf.d/nvidia.pyconf /opt/ganglia/etc/conf.d/ /etc/init.d/gmond restart When I point the browswer to cluster's ganglia page and click on 'compute-0-0', GPU metrics do not show up. What am I doing wrong? Did I miss something simple / important? Does this have anything to do with the fact that most of Rocks utilities are built with python 2.4 while this new fancy thing is compiled with python 2.7? If any of you have tried this on your cluster and got it to work, I'd greatly appreciate some direction. Thanks for your time and help. Best, g -- Gowtham Information Technology Services Michigan Technological University (906) 487/3593 http://www.it.mtu.edu/ -- Gowtham Information Technology Services Michigan Technological University (906) 487/3593 http://www.it.mtu.edu/ -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Robert, sFlow is a very simple protocol - an sFlow agent periodically sends XDR encoded structures over UDP. Each structure has a tag and a length, making the protocol extensible. In the short term, it would make sense is to define an sFlow structure to carry the current NVML metrics and tag it using NVIDIA's IANA assigned vendor number (5703). Something along the lines: /* NVML statistics */ /* opaque = counter_data; enterprise = 5703, format=1 */ struct nvml_gpu_counters { unsigned int device_count; unsigned int mem_total; unsigned int mem_util; ... } Additional examples are in the sFlow Host Structures specification (http://www.sflow.org/sflow_host.txt), these are the structures currently being exported by the Host sFlow agent. Extending the Windows Host sFlow agent to export these metrics would involve adding a routine to populate and serialize this structure - pretty straightforward - if you look at the Host sFlow agent source code you will see examples of how the existing structures are handled. For Ganglia to support the new counters, we would need to add a decoder to gmond for the new structure - also straightforward. Are per device metrics important, or can we roll up the metrics across all the GPUs on a server? With sFlow we generally roll up metrics for each node where possible - the goal is to provide enough detail so that the operations team can tell whether a node is healthy or not, but not so much as to overwhelm the monitoring system and limit scaleability. Once a problem is detected, detailed metrics for troubleshooting and diagnostics can be performed using point tools on the host. The metrics currently exposed by NVML API could be improved - everything appears to be a 1 second gauge. A more robust model for metrics is to maintain monotonic counters so that they can be polled at different frequencies and still produce meaningful results. Counters are also more robust when sending metrics over an unreliable transport like UDP. The receiver calculates the delta's and can easily compensate for lost packets. Longer term it would be useful to have a discussion to see what metrics best characterize operational performance and are feasible to implement. Counters such as number of threads started, number of busy ticks, number of idle ticks etc. are the type of measurement you want to calculate utilizations. Some kind of load average based on the thread run queue would also be interesting. My calendar is pretty open next week - I am based in San Francisco, so 8am-5pm PST works best. Peter On Thu, Jul 12, 2012 at 11:58 AM, Robert Alexander ralexan...@nvidia.com wrote: Hey, A meeting may be a good idea. My schedule is mostly open next week. When are others free? I will brush up on sflow by then. NVML and the Python metric module are tested at NVIDIA on Windows and Linux, but not within Cygwin. The process will be easier/faster on the NVML side if we keep Cygwin out of the loop. -Robert -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Thursday, July 12, 2012 10:49 AM To: Nigel LEACH Cc: lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter Phaal; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using APR), the problem is with the 3.4 spin. -Original Message- From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] Sent: 12 July 2012 11:54 To: Nigel LEACH Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com: Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hey, A meeting may be a good idea. My schedule is mostly open next week. When are others free? I will brush up on sflow by then. NVML and the Python metric module are tested at NVIDIA on Windows and Linux, but not within Cygwin. The process will be easier/faster on the NVML side if we keep Cygwin out of the loop. -Robert -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Thursday, July 12, 2012 10:49 AM To: Nigel LEACH Cc: lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter Phaal; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using APR), the problem is with the 3.4 spin. -Original Message- From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] Sent: 12 July 2012 11:54 To: Nigel LEACH Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com: Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidi a https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other
[Ganglia-general] Aggregating all GPU metrics into single graph.
To list, Let me describe the setup I administer before asking my questions. Currently I have Ganglia 3.1.7 running with Ganglia-web-3.5.7 which is monitoring roughly a 500 node CPU/GPGPU Linux cluster. Our Ganglia setup consists of one grid (i.e. one gmetad process) which represents all our nodes within our Linux cluster. Within our defined grid view, the nodes are grouped into clusters. The clusters views are the different hardware platforms we have. So, one cluster would be the Dell group, the second would be the HP group, and the third would be the Appro group.Each node within our Linux cluster may each have 4, 8 or 16 GPUs. I'm currently using the NVML Python Nvidia module to gather various metrics for each GPU for each of the 500 nodes in our cluster. Therefore within my /var/lib/ganglia/rrds/Dell_group/node1, you would find the following rrd files which represent the metrics for each GPU on node1. gpu0_graphics_speed.rrd gpu0_mem_speed.rrd gpu0_mem_total.rrd gpu0_mem_used.rrd gpu0_mem_util.rrd gpu0_sm_speed.rrd gpu0_temp.rrd gpu0_util.rrd gpu1_graphics_speed.rrd gpu1_mem_speed.rrd gpu1_mem_total.rrd gpu1_mem_used.rrd gpu1_mem_util.rrd gpu1_sm_speed.rrd gpu1_temp.rrd gpu1_util.rrd gpu2_graphics_speed.rrd gpu2_mem_speed.rrd gpu2_mem_total.rrd gpu2_mem_used.rrd gpu2_mem_util.rrd gpu2_sm_speed.rrd gpu2_temp.rrd gpu2_util.rrd gpu3_graphics_speed.rrd gpu3_mem_speed.rrd gpu3_mem_total.rrd gpu3_mem_used.rrd gpu3_mem_util.rrd gpu3_sm_speed.rrd gpu3_temp.rrd gpu3_util.rrd gpu_num.rrd Questions/Comments: --- - What I would like to do for example is take the total GPU utilization (i.e. gpu#_util.rrd) for each and every GPU on every node within our Linux cluster and display it a graph called Global Grid GPU. Eventually I would like to extend this to GPU memory for all GPUs combined and possibly other GPU metrics.What is the best way for me to achieve this? Since most of our computational work is mostly done on our GPUs, we would like to have a single graph which shows GPU utilization to present to our executive management. That way they can see how much our GPUs are being utilized. - I've been attempting to read through the Gangalia book and whatever documentation I can find and it looks like I would have to create a .php or .json script which would generate a report to begin with. That script would have to be placed in the /var/www/html/ganglia-web/graph.d directory. - Would I need to merge all of the gpu#_util.rrd files into one rrd file called gpu_util.rrd for example and the create a .php script that would extract the necessary information from the merged gpu_util.rrd file? - I'm not a .php/.json expert nor am I an expert with RRDtool. However, I'm willing to do some hacking to make it work if I could get some idea of what way is the best way to proceed? Thanks in advance for any comments/thoughts. Regards, Wayne Lee This e-mail and any attachments are for the sole use of the intended recipient(s) and may contain information that is confidential. If you are not the intended recipient(s) and have received this e-mail in error, please immediately notify the sender by return e-mail and delete this e-mail from your computer. Any distribution, disclosure or the taking of any other action by anyone other than the intended recipient(s) is strictly prohibited .-- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Compilation on Cygwin
Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I've been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I've still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla's are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won't, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn't a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say appeared successful, as make was error free, and a gmond.exe
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com: Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I've been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I've still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla's are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won't, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like
Re: [Ganglia-general] Gmond Compilation on Cygwin
Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using APR), the problem is with the 3.4 spin. -Original Message- From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] Sent: 12 July 2012 11:54 To: Nigel LEACH Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com: Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I've been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I've still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla's are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won't, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using APR), the problem is with the 3.4 spin. -Original Message- From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] Sent: 12 July 2012 11:54 To: Nigel LEACH Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com: Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I've been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I've still got a few more tests to do though
[Ganglia-general] Sample/example gmetad.conf, gmond.conf, conf.php, etc for multiple grids, one web server, nfs mounted rrds area?
Hi Umberto, I think you may have misunderstood what I may have written in one of my previous postings. Unfortunately, I've only been able to configure a single Grid with multiple clusters. From what I have read, a single gmetad defines a Grid which is a collection of clusters. What I am looking for is a Grid of Grids. That is, a single Grid which has multiple remote Grids in a single web page. This would mean multiple gmetad daemons running on a single node for each Grid one wants to define. This could mean one could have a web server for each gmetad, but that's not what I want. Examples of Grid of Grids are listed below, http://ganglia.g.gsic.titech.ac.jp/ganglia/ http://ganglia.g.gsic.titech.ac.jp/ganglia/ - This is a Grid in Japan. You can see the top Grid shows 4 sources. If you then look below, you will see four 4 Grids and when you access each Grid you will see clusters of nodes. In the case of this example, the clusters of nodes a grouped by racks of computers. This is what I want to try and setup. http://monitor.millennium.berkeley.edu/ http://monitor.millennium.berkeley.edu/ - This is another example. Note that the Infrastructure Grid is within UC Berkeley Grid. Someone did reply to one of my previous postings that he figured out how to do this and that he would post how this would be done at a later date. What I'm attempting to do is to have a Grid of Grids where all of the information is accessible from a single web server and a common RRD file directory location. I've been trying to figure out how this is done, but haven't gotten this to work right at this time. I may decide just to have a single Grid with multiple clusters since this is a little bit easier to manage. If I can figure out this out, I will post it on the mailing list so that all can benefit from it. As I stated above, I've only been able to successfully setup the following configuration under Ganglia 3.1.7. Ganglia 3.1.7 Apache Web server = - OS Version: RedHat 5.5 - RRDs files all stored on an NFS filesystem /nfs/data/ganglia/rrds. - A single Apache web server running a single gmetad daemon which collects data from 4 different clusters. - Installed NVIDIA GPU Python Ganglia module plugin. This requires the NVIDIA NVML Python binding nvidia-ml-py. If you search the mailing list, you can find more information about this if you are using NVIDIA GPUs. The binding does require Python 2.5 or higher. - I had to use Python 2.7.2 since our RedHat 5.5 systems don't have version of Python 2.5 or higher. I installed Python 2.7.2 on a common NFS mounted filesystem and built Ganglia using this version of Python. Ganglia 3.1.7 clients === - OS Version: RedHat 5.5 - All clients use a basic gmond.conf configuration using multicast. You can find examples of this is you search the mailing list or you can take a look at http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start . This is a good link to start with.I did use unicast for one set of cluster nodes we have because for some strange reason, I could only get one node to show up as being up on the web page. All other nodes would show up as being down after starting all of the gmond daemons on the cluster nodes. I'm not sure if the problem has something to do with the hardware or what. The biggest difference I can see is that this set of nodes use 10 GigE network cards. Anyway, after switching to unicast for these nodes the nodes all show as being up. - None of the clients are running gmetad. I hope what I have provided helps and I do apologize if my postings have been confusing. Since I've gotten Ganglia 3.1.7 working, I want to try and get Ganglia 3.2 to work. I don't have this working completely yet. Kind Regards, Wayne Lee From: Umberto Toscano (Gmail) [mailto:wavefor...@gmail.com] Sent: Wednesday, October 05, 2011 5:53 AM To: Lee, Wayne Subject: [Ganglia-general] Sample/example gmetad.conf, gmond.conf, conf.php, etc for multiple grids, one web server, nfs mounted rrds area? Hi Lee, I've a multiple cluster system, on each master node of my cluster i've gmond and gmetad that collect data via multicast form other gmond deamon running on each node of cluster. So i would be create a grid on my web server that collect data from gmetad running on each master node of four cluster. I've read that you successful configured Ganglia with one Grid of cluster, can you show me the gmond of a single nodes and gmetad(for master node of cluster), gmetad(for web server that aggregate) configuration file? Thank you Regards --- Umberto Toscano This e-mail and any attachments are for the sole use of the intended recipient(s) and may contain information that is confidential. If you
Re: [Ganglia-general] Gmond Compilation on Cygwin
Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I’ve been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I’ve still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla’s are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won’t, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn’t a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say “appeared successful”, as make was error free, and a gmond.exe was created. However, it doesn’t appear to work out of the box. I created a default gmond.conf ./gmond --default_config /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I’ll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I’m trying to use the Ganglia
Re: [Ganglia-general] Gmond Compilation on Cygwin
Adding Robert Alexander to the list, since he and I worked together on the NVIDIA plug-in. Thanks, Bernard On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal peter.ph...@gmail.com wrote: Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I’ve been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I’ve still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla’s are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won’t, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn’t a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say “appeared successful”, as make was error free, and a gmond.exe was created. However, it doesn’t appear to work out of the box. I created a default gmond.conf ./gmond --default_config /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I’ll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hey Nigel, I would be happy to help where I can. I think Peter's approach is a good start. We are updating the Ganglia plug-in with a few more metrics. My dev branch on github has some updates not yet in the trunk. https://github.com/ralexander/gmond_python_modules/tree/master/gpu/nvidia In terms of metrics, I can help explain what each means. I expect the usefulness of each to vary based on installation, so hopefully others can contribute their thoughts. * gpu_num - Useful indirectly. * gpu_driver - Useful when different machines may have different installed driver versions. * gpu_type - Marketing name of the GPU. * gpu_uuid - Globally unique immutable ID for the GPU chip. This is the NVIDIA preferred identifier when SW interfaces with a GPU. On a multi GPU board, each GPU has a unique UUID. * gpu_pci_id - What the GPU looks like on the PCI bus ID. + gpu_serial - For Tesla GPUs there is a serial number printed on the board. Note, that when there are multiple GPU chips on a single board, they share a common board serial number. When a human needs to grab a particular board, this number works well. * gpu_mem_total * gpu_mem_used Useful for high level application profiling. * gpu_graphics_speed + gpu_max_graphics_speed * gpu_sm_speed + gpu_max_sm_speed * gpu_mem_speed + gpu_max_mem_speed These are various clock speeds. Faster clocks - higher performance. * gpu_perf_state Similar to CPU pstates. P0 is the fastest performance. When pstate is P0 clock speeds and PCIe bandwidth can be reduced. * gpu_util * gpu_mem_util % of time when the GPU SM or GPU memory was busy over the last second This is a very coarse grain way to monitor GPU usage. I.E. If only one SM is busy, but it is busy for the entire second then gpu_util = 100 * gpu_fan * gpu_temp Some GPUs support these. Useful to see how well the GPU is cooled. * gpu_power_usage + gpu_power_man_mode + gpu_power_man_limit GPU power draw. Some GPUs support configurable power limits via power management mode. * gpu_ecc_mode Useful to ensure all GPUs are configured the same. Describes if GPU memory error checking and correction is on or off. If you are only concerned about coarse grained GPU performance, then GPU performance state, utilization and %memory used may work well. Bernard, thanks for the heads up. Hope that helps, Robert Alexander NVIDIA CUDA Tools Software Engineer -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Tuesday, July 10, 2012 12:32 PM To: Peter Phaal Cc: Nigel LEACH; ganglia-general@lists.sourceforge.net; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Adding Robert Alexander to the list, since he and I worked together on the NVIDIA plug-in. Thanks, Bernard On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal peter.ph...@gmail.com wrote: Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http
Re: [Ganglia-general] Ganglia-general Digest, Vol 61, Issue 14
:6F:14:20:09 inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:185 Memory:ec00-ec012800 I tried creating a static route with route add -host 239.2.11.71 dev eth0, but I still get the same error. Any hints on further troubleshooting I can do to track down this problem, or further information I can send out? Mark -- Message: 8 Date: Thu, 16 Jun 2011 23:21:52 -0700 From: Bernard Li bern...@vanhpc.org Subject: [Ganglia-general] Gmond Python module for monitoring NVIDIA GPUs To: Ganglia ganglia-general@lists.sourceforge.net Message-ID: banlktim6+mbedo0x-jok5edxs67mc8u...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 Dear all: Just a quick note letting you guys know that we now have a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia If you are running a cluster with NVIDIA GPUs, please download the module and give it a try. The module itself is pretty much feature complete, but the GUI/reports still need some work. It would be cool if we could extend it to work with the new gweb 2.0 as well. Please feel free to fork the repo and submit pull requests. Special thanks to the team at NVIDIA for their help in implementing the plugin and Jeremy Enos at NCSA for providing access to a NVIDIA GPU cluster. Cheers, Bernard -- -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general End of Ganglia-general Digest, Vol 61, Issue 14 *** -- - Yours sincerely, Huaxing Guo Email address: ghxand...@gmail.com High Performance and Grids Computing Center Sun Yat-Sen University Canton 51, China -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general