Re: [Ganglia-general] Gmond Compilation on Cygwin

2012-07-12 Thread Robert Alexander
Hey,

A meeting may be a good idea.  My schedule is mostly open next week.  When are 
others free?  I will brush up on sflow by then.

NVML and the Python metric module are tested at NVIDIA on Windows and Linux, 
but not within Cygwin.  The process will be easier/faster on the NVML side if 
we keep Cygwin out of the loop.

-Robert

-Original Message-
From: Bernard Li [mailto:bern...@vanhpc.org]
Sent: Thursday, July 12, 2012 10:49 AM
To: Nigel LEACH
Cc: lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter 
Phaal; Robert Alexander
Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin

Hi Nigel:

Technically you only need 3.1 gmond to have support for the Python metric 
module.  But I'm not sure whether we have ever tested this under Windows.

Peter and Robert: How quickly can we get hsflowd to support GPU metrics 
collection internally?  Should we setup a meeting to discuss this?

Thanks,

Bernard

On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com 
wrote:
 Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using 
 APR), the problem is with the 3.4 spin.

 -Original Message-
 From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com]
 Sent: 12 July 2012 11:54
 To: Nigel LEACH
 Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net
 Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin

 Hi all,

 Maybe it will be interesting. Some time ago I successfully compiled gmond 
 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 
 3rd party sources + compilation script.
 Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed 
 (just for fun) my implementation of gmetad 3.1.2 using .NET and C#.

 P. S. I do not know whether it is possible to use these gmong versions to 
 collect statistic from GPU.

 --
 Best regards,
 Ivan.

 2012/7/12 Nigel LEACH nigel.le...@uk.bnpparibas.com:
 Thanks for the updates Peter and Bernard.

 I have been unable to get gmond 3.4 working under Cygwin, my latest errors 
 are parsing gm_protocol_xdr.c. I don't know whether we should follow this 
 up, it would be nice to have a Windows gmond, but my only reason for 
 upgrading are the GPU metrics.

 I take you point about re-using the existing GPU module and gmetric, 
 unfortunately I don't have experience with Python. My plan is to write 
 something in C to export the nvml metrics, with various output options. We 
 will then decide whether to call this new code from existing gmond 3.1 via 
 gmetric, new (if we get it working) gmond 3.4, or one of our existing third 
 party tools - ITRS Geneous.

 As regards your list of metrics they are pretty definitive, but I
 will probably also export

 *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc
 errors - nvmlDeviceGetDetailedEccErrors *active compute processes -
 nvmlDeviceGetComputeRunningProcesses

 Regards
 Nigel

 -Original Message-
 From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com]
 Sent: 10 July 2012 20:06
 To: Nigel LEACH
 Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net
 Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin

 Nigel,

 A simple option would be to use Host sFlow agents to export the core metrics 
 from your Windows servers and use gmetric to send add the GPU metrics.

 You could combine code from the python GPU module and gmetric
 implementations to produce a self contained script for exporting GPU
 metrics:

 https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidi
 a https://github.com/ganglia/ganglia_contrib

 Longer term, it would make sense to extend Host sFlow to use the C-based 
 NVML API to extract and export metrics. This would be straightforward - the 
 Host sFlow agent uses native C APIs on the platforms it supports to extract 
 metrics.

 What would take some thought is developing standard set of summary metrics 
 to characterize GPU performance. Once the set of metrics is agreed on, then 
 adding them to the sFlow agent is pretty trivial.

 Currently the Ganglia python module exports the following metrics - are they 
 the right set? Anything missing? It would be great to get involvement from 
 the broader Ganglia community to capture best practice from anyone running 
 large GPU clusters, as well as getting input from NVIDIA about the key 
 metrics.

 * gpu_num
 * gpu_driver
 * gpu_type
 * gpu_uuid
 * gpu_pci_id
 * gpu_mem_total
 * gpu_graphics_speed
 * gpu_sm_speed
 * gpu_mem_speed
 * gpu_max_graphics_speed
 * gpu_max_sm_speed
 * gpu_max_mem_speed
 * gpu_temp
 * gpu_util
 * gpu_mem_util
 * gpu_mem_used
 * gpu_fan
 * gpu_power_usage
 * gpu_perf_state
 * gpu_ecc_mode

 As far as scalability is concerned, you should find that moving to sFlow as 
 the measurement transport reduces network traffic since all the metrics for 
 a node are transported in a single UDP datagram (rather than a datagram per 
 metric when using gmond as the agent). The other

Re: [Ganglia-general] Gmond Compilation on Cygwin

2012-07-10 Thread Robert Alexander
Hey Nigel,

I would be happy to help where I can.  I think Peter's approach is a good start.

We are updating the Ganglia plug-in with a few more metrics.  My dev branch on 
github has some updates not yet in the trunk.
https://github.com/ralexander/gmond_python_modules/tree/master/gpu/nvidia

In terms of metrics, I can help explain what each means.  I expect the 
usefulness of each to vary based on installation, so hopefully others can 
contribute their thoughts.

* gpu_num - Useful indirectly.
* gpu_driver - Useful when different machines may have different installed 
driver versions.

* gpu_type - Marketing name of the GPU.
* gpu_uuid - Globally unique immutable ID for the GPU chip.  This is the NVIDIA 
preferred identifier when SW interfaces with a GPU.  On a multi GPU board, each 
GPU has a unique UUID.
* gpu_pci_id - What the GPU looks like on the PCI bus ID.
+ gpu_serial - For Tesla GPUs there is a serial number printed on the board.  
Note, that when there are multiple GPU chips on a single board, they share a 
common board serial number.  When a human needs to grab a particular board, 
this number works well.

* gpu_mem_total
* gpu_mem_used
Useful for high level application profiling.

* gpu_graphics_speed
+ gpu_max_graphics_speed
* gpu_sm_speed
+ gpu_max_sm_speed 
* gpu_mem_speed
+ gpu_max_mem_speed
These are various clock speeds.  Faster clocks - higher performance.

* gpu_perf_state
Similar to CPU pstates.  P0 is the fastest performance.  When pstate is 
P0 clock speeds and PCIe bandwidth can be reduced.

* gpu_util
* gpu_mem_util
% of time when the GPU SM or GPU memory was busy over the last second
This is a very coarse grain way to monitor GPU usage.
I.E. If only one SM is busy, but it is busy for the entire 
second then gpu_util = 100
* gpu_fan
* gpu_temp
Some GPUs support these.  Useful to see how well the GPU is cooled.

* gpu_power_usage
+ gpu_power_man_mode
+ gpu_power_man_limit
GPU power draw.  Some GPUs support configurable power limits via power 
management mode.

* gpu_ecc_mode
Useful to ensure all GPUs are configured the same.  Describes if GPU 
memory error checking and correction is on or off.

If you are only concerned about coarse grained GPU performance, then GPU 
performance state, utilization and %memory used may work well.

Bernard, thanks for the heads up.

Hope that helps,
Robert Alexander
NVIDIA CUDA Tools Software Engineer

-Original Message-
From: Bernard Li [mailto:bern...@vanhpc.org] 
Sent: Tuesday, July 10, 2012 12:32 PM
To: Peter Phaal
Cc: Nigel LEACH; ganglia-general@lists.sourceforge.net; Robert Alexander
Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin

Adding Robert Alexander to the list, since he and I worked together on the 
NVIDIA plug-in.

Thanks,

Bernard

On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal peter.ph...@gmail.com wrote:
 Nigel,

 A simple option would be to use Host sFlow agents to export the core 
 metrics from your Windows servers and use gmetric to send add the GPU 
 metrics.

 You could combine code from the python GPU module and gmetric 
 implementations to produce a self contained script for exporting GPU
 metrics:

 https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
 https://github.com/ganglia/ganglia_contrib

 Longer term, it would make sense to extend Host sFlow to use the 
 C-based NVML API to extract and export metrics. This would be 
 straightforward - the Host sFlow agent uses native C APIs on the 
 platforms it supports to extract metrics.

 What would take some thought is developing standard set of summary 
 metrics to characterize GPU performance. Once the set of metrics is 
 agreed on, then adding them to the sFlow agent is pretty trivial.

 Currently the Ganglia python module exports the following metrics - 
 are they the right set? Anything missing? It would be great to get 
 involvement from the broader Ganglia community to capture best 
 practice from anyone running large GPU clusters, as well as getting 
 input from NVIDIA about the key metrics.

 * gpu_num
 * gpu_driver
 * gpu_type
 * gpu_uuid
 * gpu_pci_id
 * gpu_mem_total
 * gpu_graphics_speed
 * gpu_sm_speed
 * gpu_mem_speed
 * gpu_max_graphics_speed
 * gpu_max_sm_speed
 * gpu_max_mem_speed
 * gpu_temp
 * gpu_util
 * gpu_mem_util
 * gpu_mem_used
 * gpu_fan
 * gpu_power_usage
 * gpu_perf_state
 * gpu_ecc_mode

 As far as scalability is concerned, you should find that moving to 
 sFlow as the measurement transport reduces network traffic since all 
 the metrics for a node are transported in a single UDP datagram 
 (rather than a datagram per metric when using gmond as the agent). The 
 other consideration is that sFlow is unicast, so if you are using a 
 multicast Ganglia setup then this involves re-structuring your a 
 configuration.

 You still need to have at least one gmond instance, but it acts as an 
 sFlow aggregator and is mute:
 http