Hello Kostas,

Here is a theory and a patch fixing it. (which I'm submiting in the sequence)

Ganglia uses float variables (double) to store the jiffies and the jiffies sums. Analyzing how procps [1] works with /proc/stat, I see it uses integer values (unsined long long) to store those same values, and uses float to store the percentage only.
Theory is: Float numbers have a problem with precision and Ganglia is getting lost with big ppc64 numbers on its calculations.

I've patched Ganglia's libmetrics/linux/metrics.c to use integers (unsined long long) to store the jiffies and jiffies sums and floats (double) to store the calculated numbers only.

I've reproduced the problem in a test machine, and tested the fix successfully. [2]

Please, just let me know in any quesiton.

Att,
xavier


1: procps is the package that has a bunch of small useful utilities that give information about processes using the /proc filesystem

2:
Before patching:
# nc localhost 9001 | grep cpu_idle
<METRIC NAME="cpu_idle" VAL="27251.1" TYPE="float" UNITS="%" TN="0" TMAX="90" DMAX="0" SLOPE="both">

After patching:
# nc localhost 9001 | grep cpu_idle
<METRIC NAME="cpu_idle" VAL="99.7" TYPE="float" UNITS="%" TN="0" TMAX="90" DMAX="0" SLOPE="both">

Reverting patch:
# nc localhost 9001 | grep cpu_idle
<METRIC NAME="cpu_idle" VAL="29738.0" TYPE="float" UNITS="%" TN="21" TMAX="90" DMAX="0" SLOPE="both">





Rafael Xavier de Souza
Linux Technology Center Software Engineer
IBM Systems & Technology Group
[email protected]
MM17 Hortolândia-SP, Brazil

Em 30-04-2010 09:21, Kostas Georgiou escreveu:
On Wed, Apr 28, 2010 at 03:04:51PM -0300, Rafael Xavier de Souza wrote:

  
The system/idle cpu metrics gets crazy on some of my ppc64 boxes.
I was wondering if any of you have seen this already or have any
idea of what could be going on?
    
  
I'm guessing it has anything to do with parsing /proc/stat, since
ganglia retrieve cpu info from it. I'm attaching a snapshot of
/proc/stat from the crazymachine in case it helps.

Any ideas?
    
I think that the code has a limit of 32 cpus in the /proc/stat parsing,
no idea if this is the cause.

Kostas 
  
------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to