Re: [Ganglia-general] Ganglia 3.2.0 questions.
Hi All, I am facing some problem while opening the web page for ganglia. I have installed ganglia-3.2.0 and did all the steps, before also I used to follow same steps and it worked fine but now getting the below error while loading the web page http://ip-address/ganglia The requested URL /pages/Main_Page was not found on this server. Apache/2.2.3 Guys please help me in solving this issue.. -- Thanks Regards, Padma Pavani -Original Message- From: Lee, Wayne [mailto:w...@hess.com] Sent: Friday, October 14, 2011 9:55 PM To: Vladimir Vuksan Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Ganglia 3.2.0 questions. Hi Vladimir, Thanks for your reply. So if I want to have the kind of graphs in Ganglia 3.2 in my 3.1.7 version, I just need to untar the files in gweb-2.1.7 into my Apache root Document directory, /var/www/html/ganglia. Is that right? Also, I see that there is something called dwoo in the gweb-2.1.7 tar file? Not sure what dwoo is. Do I need to download an RPM or other application onto my web server which is hosting the Ganglia web pages? Do you know much about the NVIDIA GPU module that was included with the set of Python modules that came with the 3.1.7 Ganglia tar file? I will post a separate post about it shortly. I've gotten Ganglia 3.1.7 running and it looks great. If the gweb-2.1.7 items can be installed to provide me the newer style graphs, that would be great and if I can get it to work with displaying 8 NVIDIA GPUs on our systems. Kind Regards, Wayne Lee -Original Message- From: Vladimir Vuksan [mailto:vli...@veus.hr] Sent: Friday, October 07, 2011 8:14 AM To: Lee, Wayne Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Ganglia 3.2.0 questions. Major changes between 3.1.7 and 3.2.0 are - sFlow support for hosts - Ability to override the hostname of the node - gmetric add metric group support If you don't need those features 3.1.7 will do just fine. You can download the new web UI separately and it's completely independent on Ganglia version. On Tue, 4 Oct 2011, Lee, Wayne wrote: I’ve been running Ganglia 3.1.7 on a set of test system for a few weeks and I was wondering if I should go to version 3.2.0. I did install it and have run it across the same set of test nodes I used for version 3.1.7. A few comments/questions. 1. I like the new graphs and layout of 3.2.0. 2. Is there any documentation specific to 3.2.0 with regards to what is required and what needs to be configured to make It work? I did use my client and server configuration files with hardly any changes and I did get most of the web pages and nodes to display, but when I drill down to look at the “host view” for individual nodes, I do not see anything. No individual graphs or stats. I do notice that in my /var/lib/ganglia directory there is now a “dwoo” subdirectory along with the rrds directory. The “dwoo” directory contains some *.php scripts. Are these responsible for generating the “host view” information for the individual nodes? I have no idea what “dwoo” is. 3. I attempted to try and get the NVIDIA gpu Python module to work, but I do not get any stats for the GPUs on my compute nodes. Has anyone had any success in getting the NVIDIA module to work with 3.2.0? 4. Also, I seem to recall when I installed 3.1.7 that the contents of /var/www/html/ganglia were created during the “make install” of 3.1.7. I didn’t see this happen with 3.2.0. I just manually copied the files in the 3.2.0 tarball in the “web” directory into /var/www/html/ganglia. Is this the way the install should have been done? This e-mail and any attachments are for the sole use of the intended recipient(s) and may contain information that is confidential. If you are not the intended recipient(s) and have received this e-mail in error, please immediately notify the sender by return e-mail and delete this e-mail from your computer. Any distribution, disclosure or the taking of any other action by anyone other than the intended recipient(s) is strictly prohibited. -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond.
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn't a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say appeared successful, as make was error free, and a gmond.exe was created. However, it doesn't appear to work out of the box. I created a default gmond.conf ./gmond --default_config /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I'll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I'm trying to use the Ganglia Python modules to monitor a Windows based GPU cluster, but having problems getting gmond to compile. This 'configure' completes successfully ./configure --with-libconfuse=/usr/local --without-libpcre --enable-static-build but 'make' fails, this is the tail of standard output mv -f .deps/g25_config.Tpo .deps/g25_config.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF .deps/core_metrics .Tpo -c -o core_metrics.o core_metrics.c mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF .deps/sflow.Tpo -c -o sfl ow.o sflow.c sflow.c: In function `process_struct_JVM': sflow.c:1033: warning: comparison is always true due to limited range of data type sflow.c:1034: warning: comparison is always true due to limited range of data type sflow.c:1035: warning: comparison is always true due to limited range of data type sflow.c:1036: warning: comparison is always true due to limited range of data type sflow.c:1037: warning: comparison is always true due to limited range of data type sflow.c:1038: warning: comparison is always true due to limited range of data type sflow.c:1039: warning: comparison is always true due to limited range of data type sflow.c: In function `processCounterSample': sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) sflow.c: In function `process_sflow_datagram': sflow.c:1348: error: `AF_INET6' undeclared (first use in this function) sflow.c:1348: error: (Each undeclared identifier is reported only once sflow.c:1348: error: for each function it appears in.) make[3]: *** [sflow.o] Error 1 make[3]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/var/tmp/ganglia-3.4.0' make: *** [all] Error 2 Has anyone come across this before ? Many Thanks Nigel ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.netmailto:Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. ** ** I want to take a look at sFlow, but it isn’t a prerequisite. ** ** Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. ** ** I say “appeared successful”, as make was error free, and a gmond.exe was created. However, it doesn’t appear to work out of the box. I created a default gmond.conf ** ** ./gmond --default_config /usr/local/etc/gmond.conf ** ** and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this ** ** $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module ** ** ** ** and nothing further. ** ** I have done little investigation yet, so unless there is anything obvious I am missing, I’ll continue to troubleshoot. ** ** Regards Nigel ** ** ** ** *From:* neil.mckee...@gmail.com javascript:_e({}, 'cvml', 'neil.mckee...@gmail.com'); [mailto:neil.mckee...@gmail.comjavascript:_e({}, 'cvml', 'neil.mckee...@gmail.com');] *Sent:* 09 July 2012 18:15 *To:* Nigel LEACH *Cc:* ganglia-general@lists.sourceforge.net javascript:_e({}, 'cvml', 'ganglia-general@lists.sourceforge.net'); *Subject:* Re: [Ganglia-general] Gmond Compilation on Cygwin ** ** You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). ** ** Neil ** ** ** ** On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I’m trying to use the Ganglia Python modules to monitor a Windows based GPU cluster, but having problems getting gmond to compile. This ‘configure’ completes successfully ./configure --with-libconfuse=/usr/local --without-libpcre --enable-static-build but ‘make’ fails, this is the tail of standard output mv -f .deps/g25_config.Tpo .deps/g25_config.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF .deps/core_metrics .Tpo -c -o core_metrics.o core_metrics.c mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF .deps/sflow.Tpo -c -o sfl ow.o sflow.c sflow.c: In function `process_struct_JVM': sflow.c:1033: warning: comparison is always true due to limited range of data type ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hello Bernard, I was coming to that conclusion, I've been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I've still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla's are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won't, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn't a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say appeared successful, as make was error free, and a gmond.exe was created. However, it doesn't appear to work out of the box. I created a default gmond.conf ./gmond --default_config /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I'll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.comjavascript:_e(%7b%7d,%20'cvml',%20'neil.mckee...@gmail.com'); [mailto:neil.mckee...@gmail.comjavascript:_e(%7b%7d,%20'cvml',%20'neil.mckee...@gmail.com');] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.netjavascript:_e(%7b%7d,%20'cvml',%20'ganglia-general@lists.sourceforge.net'); Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I'm trying to use the Ganglia Python modules to monitor a Windows based GPU cluster, but having problems getting gmond to compile. This 'configure' completes successfully ./configure --with-libconfuse=/usr/local --without-libpcre --enable-static-build but 'make' fails, this is the tail of standard output mv -f .deps/g25_config.Tpo .deps/g25_config.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF .deps/core_metrics .Tpo -c -o core_metrics.o core_metrics.c mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF .deps/sflow.Tpo -c -o sfl ow.o sflow.c sflow.c: In function `process_struct_JVM': sflow.c:1033: warning: comparison is always true due to limited range of data type ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for
Re: [Ganglia-general] Gmond Compilation on Cygwin
Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I’ve been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I’ve still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla’s are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won’t, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn’t a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say “appeared successful”, as make was error free, and a gmond.exe was created. However, it doesn’t appear to work out of the box. I created a default gmond.conf ./gmond --default_config /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I’ll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I’m trying to use the Ganglia
Re: [Ganglia-general] Gmond Compilation on Cygwin
Adding Robert Alexander to the list, since he and I worked together on the NVIDIA plug-in. Thanks, Bernard On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal peter.ph...@gmail.com wrote: Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: Hello Bernard, I was coming to that conclusion, I’ve been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I’ve still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla’s are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won’t, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn’t a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say “appeared successful”, as make was error free, and a gmond.exe was created. However, it doesn’t appear to work out of the box. I created a default gmond.conf ./gmond --default_config /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I’ll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding --disable-sflow as another configure option. (Or were you planning to use sFlow
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hey Nigel, I would be happy to help where I can. I think Peter's approach is a good start. We are updating the Ganglia plug-in with a few more metrics. My dev branch on github has some updates not yet in the trunk. https://github.com/ralexander/gmond_python_modules/tree/master/gpu/nvidia In terms of metrics, I can help explain what each means. I expect the usefulness of each to vary based on installation, so hopefully others can contribute their thoughts. * gpu_num - Useful indirectly. * gpu_driver - Useful when different machines may have different installed driver versions. * gpu_type - Marketing name of the GPU. * gpu_uuid - Globally unique immutable ID for the GPU chip. This is the NVIDIA preferred identifier when SW interfaces with a GPU. On a multi GPU board, each GPU has a unique UUID. * gpu_pci_id - What the GPU looks like on the PCI bus ID. + gpu_serial - For Tesla GPUs there is a serial number printed on the board. Note, that when there are multiple GPU chips on a single board, they share a common board serial number. When a human needs to grab a particular board, this number works well. * gpu_mem_total * gpu_mem_used Useful for high level application profiling. * gpu_graphics_speed + gpu_max_graphics_speed * gpu_sm_speed + gpu_max_sm_speed * gpu_mem_speed + gpu_max_mem_speed These are various clock speeds. Faster clocks - higher performance. * gpu_perf_state Similar to CPU pstates. P0 is the fastest performance. When pstate is P0 clock speeds and PCIe bandwidth can be reduced. * gpu_util * gpu_mem_util % of time when the GPU SM or GPU memory was busy over the last second This is a very coarse grain way to monitor GPU usage. I.E. If only one SM is busy, but it is busy for the entire second then gpu_util = 100 * gpu_fan * gpu_temp Some GPUs support these. Useful to see how well the GPU is cooled. * gpu_power_usage + gpu_power_man_mode + gpu_power_man_limit GPU power draw. Some GPUs support configurable power limits via power management mode. * gpu_ecc_mode Useful to ensure all GPUs are configured the same. Describes if GPU memory error checking and correction is on or off. If you are only concerned about coarse grained GPU performance, then GPU performance state, utilization and %memory used may work well. Bernard, thanks for the heads up. Hope that helps, Robert Alexander NVIDIA CUDA Tools Software Engineer -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Tuesday, July 10, 2012 12:32 PM To: Peter Phaal Cc: Nigel LEACH; ganglia-general@lists.sourceforge.net; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Adding Robert Alexander to the list, since he and I worked together on the NVIDIA plug-in. Thanks, Bernard On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal peter.ph...@gmail.com wrote: Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute:
[Ganglia-general] Rv: Re: Can't import the metric module [nvidia]
De: fabian cruz fabo...@yahoo.com.mx Asunto: Re: [Ganglia-general] Can't import the metric module [nvidia] A: Mohd Mozammil khan moz_r...@yahoo.com Fecha: martes, 10 de julio de 2012, 18:16 Thanks Mozammil I recompiled gmond using --with-python and now it's working without problems. PS, in order to recompile gmond with the --with-python parameter I was needed to recompile python using --with-pydebug --enable-shared parameters Fabian --- El sáb 7-jul-12, Mohd Mozammil khan moz_r...@yahoo.com escribió: De: Mohd Mozammil khan moz_r...@yahoo.com Asunto: Re: [Ganglia-general] Can't import the metric module [nvidia] A: fabian cruz fabo...@yahoo.com.mx, ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Fecha: sábado, 7 de julio de 2012, 8:36 You may need to recompile the gmond with --with-python=path to python bin argument. I would also suggest you to go through by Installation Instructions link https://github.com/ganglia/gmond_python_modules/blob/master/gpu/nvidia/README for NVIDIA GPU monitoring plugin for gmond.Dont't miss to install Python Binding for NVIDIA Management Library. Thanks,Mozammil From: fabian cruz fabo...@yahoo.com.mx To: ganglia-general@lists.sourceforge.net Sent: Friday, 6 July 2012 11:27 PM Subject: [Ganglia-general] Can't import the metric module [nvidia] Hi, I am trying to add gpu metrics to ganglia using the python module provide by nividia at https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia When I try to run gmond on the client node I get the following error message: PYTHON] Can't import the metric module [nvidia]. Traceback (most recent call last): File /usr/lib64/ganglia/python_modules/nvidia.py, line 27, in ? from pynvml import * File /usr/lib64/ganglia/python_modules/pynvml.py, line 32, in ? ## ImportError: No module named ctypes I am using python version 2.4: [root@vsa038033 Python-2.7.3]# rpm -q python python-2.4.3-27.cgvel5 [root@vsa038033 Python-2.7.3]# According with nvidia python module documentation, I should use python version 2.4 so I installed a python 2.7 at /usr/local/python2.7 but I don't know how to tell gmond to use this new version for the nvidia module. Any idea? Thanks -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general