Nigel,

A simple option would be to use Host sFlow agents to export the core
metrics from your Windows servers and use gmetric to send add the GPU
metrics.

You could combine code from the python GPU module and gmetric
implementations to produce a self contained script for exporting GPU
metrics:

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
https://github.com/ganglia/ganglia_contrib

Longer term, it would make sense to extend Host sFlow to use the
C-based NVML API to extract and export metrics. This would be
straightforward - the Host sFlow agent uses native C APIs on the
platforms it supports to extract metrics.

What would take some thought is developing standard set of summary
metrics to characterize GPU performance. Once the set of metrics is
agreed on, then adding them to the sFlow agent is pretty trivial.

Currently the Ganglia python module exports the following metrics -
are they the right set? Anything missing? It would be great to get
involvement from the broader Ganglia community to capture best
practice from anyone running large GPU clusters, as well as getting
input from NVIDIA about the key metrics.

* gpu_num
* gpu_driver
* gpu_type
* gpu_uuid
* gpu_pci_id
* gpu_mem_total
* gpu_graphics_speed
* gpu_sm_speed
* gpu_mem_speed
* gpu_max_graphics_speed
* gpu_max_sm_speed
* gpu_max_mem_speed
* gpu_temp
* gpu_util
* gpu_mem_util
* gpu_mem_used
* gpu_fan
* gpu_power_usage
* gpu_perf_state
* gpu_ecc_mode

As far as scalability is concerned, you should find that moving to
sFlow as the measurement transport reduces network traffic since all
the metrics for a node are transported in a single UDP datagram
(rather than a datagram per metric when using gmond as the agent). The
other consideration is that sFlow is unicast, so if you are using a
multicast Ganglia setup then this involves re-structuring your a
configuration.

You still need to have at least one gmond instance, but it acts as an
sFlow aggregator and is mute:
http://blog.sflow.com/2011/07/ganglia-32-released.html

Peter

On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH
<[email protected]> wrote:
> Hello Bernard, I was coming to that conclusion, I’ve been trying to compile
> on various combinations of Cygwin, Windows, Hardware this afternoon, but
> without success yet. I’ve still got a few more tests to do though.
>
>
>
> The GPU plugin is my only reason for upgrading from our current 3.1.7, and
> there is nothing else esoteric we use. We do have Linux Blades, but all of
> our Tesla’s are hosted on Windows.  The entire estate is quite large, so we
> would need to ensure sFlow scales, no reason to think it won’t, but I have
> little experience with it..
>
>
>
> Regards
>
> Nigel
>
>
>
> From: [email protected] [mailto:[email protected]]
> Sent: 10 July 2012 16:19
> To: Nigel LEACH
> Cc: [email protected]; [email protected]
>
>
> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin
>
>
>
> Hi Nigel:
>
>
>
> Perhaps other developers could chime in but I'm not sure if the latest
> version could be compiled under Windows, at least I was not aware of any
> testing done.
>
>
>
> Going forward I would like to encourage users to use hsflowd under Windows.
> I'm talking to the developers to see if we can add support for GPU
> monitoring.  Do you have any other requirements besides that?
>
>
>
> Thanks,
>
>
>
> Bernard
>
> On Tuesday, July 10, 2012, Nigel LEACH wrote:
>
> Hi Neil, Many thanks for the swift reply.
>
>
>
> I want to take a look at sFlow, but it isn’t a prerequisite.
>
>
>
> Anyway, I disabled sFlow, and (separately) included the patch you sent. Both
> fixes appeared successful. For now I am going with your patch, and sFlow
> enabled.
>
>
>
> I say “appeared successful”, as make was error free, and a gmond.exe was
> created. However, it doesn’t appear to work out of the box. I created a
> default gmond.conf
>
>
>
> ./gmond --default_config > /usr/local/etc/gmond.conf
>
>
>
> and then simply ran gmond. It started a process, but no port (8649) was
> created. Running in debug mode I get this
>
>
>
> $ ./gmond -d 10
>
> loaded module: core_metrics
>
> loaded module: cpu_module
>
> loaded module: disk_module
>
> loaded module: load_module
>
> loaded module: mem_module
>
> loaded module: net_module
>
> loaded module: proc_module
>
> loaded module: sys_module
>
>
>
>
>
> and nothing further.
>
>
>
> I have done little investigation yet, so unless there is anything obvious I
> am missing, I’ll continue to troubleshoot.
>
>
>
> Regards
>
> Nigel
>
>
>
>
>
> From: [email protected] [mailto:[email protected]]
> Sent: 09 July 2012 18:15
> To: Nigel LEACH
> Cc: [email protected]
> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin
>
>
>
> You could try adding "--disable-sflow" as another configure option.   (Or
> were you planning to use sFlow agents such as hsflowd?).
>
>
>
> Neil
>
>
>
>
>
> On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote:
>
>
>
> Ganglia 3.4.0
>
> Windows 2008 R2 Enterprise
>
> Cygwin 1.5.25
>
> IBM iDataPlex dx360 with Tesla M2070
>
> Confuse 2.7
>
>
>
> I’m trying to use the Ganglia Python modules to monitor a Windows based GPU
> cluster, but having problems getting gmond to compile. This ‘configure’
> completes successfully
>
>
>
> ./configure --with-libconfuse=/usr/local --without-libpcre
> --enable-static-build
>
>
>
> but ‘make’ fails, this is the tail of standard output
>
>
>
> mv -f .deps/g25_config.Tpo .deps/g25_config.Po
>
> gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1
> -I/usr/include/ap
>
> r-1    -I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW
> -g -O2 -I/usr/
>
> local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF
> .deps/core_metrics
>
> .Tpo -c -o core_metrics.o core_metrics.c
>
> mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po
>
> gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1
> -I/usr/include/ap
>
> r-1    -I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW
> -g -O2 -I/usr/
>
> local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF
> .deps/sflow.Tpo -c -o sfl
>
> ow.o sflow.c
>
> sflow.c: In function `process_struct_JVM':
>
> sflow.c:1033: warning: comparison is always true due to limited range of
> data type
>
>
> ___________________________________________________________
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and delete this e-mail. Any
> unauthorised copying, disclosure or distribution of the material in this
> e-mail is prohibited.
>
> Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for
> additional disclosures.
>
>
> ___________________________________________________________
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and delete this e-mail. Any
> unauthorised copying, disclosure or distribution of the material in this
> e-mail is prohibited.
>
> Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for
> additional disclosures.
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to