Re: [Ganglia-general] Insane negative values for cpu_idle and cpu_wio when node is CPU bound

2013-10-03 Thread Khrist Hansen
This appears to be an issue with Mr. Perzl's updated libperfstat code
borrowed from IBM's perfstat_cpu_total example.

ftp://www.oss4aix.org/ganglia/RPMs-3.3.7/src/ganglia-3.3.7-aix.patch

http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.prftools
/doc/prftools/idprftools_perfstat_glob_cpu.htm

When calculating wio and idle, the code performs a divide operation with
(dlt_lcpu_wait + dlt_lcpu_idle) as the divisor.  If the server is CPU bound,
i.e. usr+sys=100%, then both dlt_lcpu_wait and dlt_lcpu_idle will be zero,
and the division will occur with zero as the divisor.

This should be a fairly simple fix, and I am attempting to contact Mr. Perzl
to that effect.


-Original Message-
From: Khrist Hansen [mailto:khrist.han...@gmail.com] 
Sent: Wednesday, October 02, 2013 6:18 PM
To: ganglia-general@lists.sourceforge.net
Subject: RE: Insane negative values for cpu_idle and cpu_wio when node is
CPU bound

Here is another example from gstat:
 CPUs (Procs/Total) [ 1, 5, 15min] [  User,  Nice, System, Idle,
Wio]
8 (8/  122) [  4.59,  2.04,  1.35] [  99.8,   0.0,
0.2,-67062349824.0,-67062349824.0] OFF

Looking at the source code for AIX metrics
(https://github.com/ganglia/monitor-core/blob/master/libmetrics/aix/metrics.
c), it appears that negative values should be converted to 0.  This is
either not happening or the metrics are somehow being modified after the
fact.

g_val_t
cpu_wio_func ( void )
{
   g_val_t val;
   
   get_cpuinfo();
   val.f = CALC_CPUINFO(wait);


   if(val.f  0) val.f = 0.0;
   return val;
}

g_val_t
cpu_idle_func ( void )
{
   g_val_t val;


   get_cpuinfo();
   val.f = CALC_CPUINFO(idle);


   if(val.f  0) val.f = 0.0;
   return val;
}


From: K. Hansen
Sent: Wednesday, October 02, 2013 4:50 PM
To: ganglia-general@lists.sourceforge.net
Subject: Insane negative values for cpu_idle and cpu_wio when node is CPU
bound

Environment:
AIX 6.1 TL7 SP7
gmond 3.6.0 (from http://www.perzl.org/ganglia/)

I noticed that a particular node would send insanely high negative values
for cpu_idle and cpu_wait metrics when cpu_user + cpu_system were near 100%,
i.e. the node is completely CPU bound.  The result is major skewing of the
node's cpu_idle and cpu_wio graphs so that no true positive values are
visible, and the cpu_report graph for the node, cluster, and grid become
corrupted.

Here is an example of what I am talking about:  http://imgur.com/a/aIzyU

I am able to replicate this behavior on any AIX node by running the
following command to generate CPU load:

perl -e 'while (--$ARGV[0] and fork) {}; while () {}' 8

Where the last digit is the number of threads available to the server.  For
example, if a server has 2 POWER7 vCPU, then it has 8 threads (logical CPU)
due to 4-way simultaneous multithreading (SMT).

Has anyone else experienced this on AIX or Linux?

Thanks!




--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Insane negative values for cpu_idle and cpu_wio when node is CPU bound

2013-10-02 Thread Khrist Hansen
Environment:

AIX 6.1 TL7 SP7

gmond 3.6.0 (from http://www.perzl.org/ganglia/)

 

I noticed that a particular node would send insanely high negative values
for cpu_idle and cpu_wait metrics when cpu_user + cpu_system were near 100%,
i.e. the node is completely CPU bound.  The result is major skewing of the
node's cpu_idle and cpu_wio graphs so that no true positive values are
visible, and the cpu_report graph for the node, cluster, and grid become
corrupted.

 

Here is an example of what I am talking about:  http://imgur.com/a/aIzyU

 

I am able to replicate this behavior on any AIX node by running the
following command to generate CPU load:

 

perl -e 'while (--$ARGV[0] and fork) {}; while () {}' 8

 

Where the last digit is the number of threads available to the server.  For
example, if a server has 2 POWER7 vCPU, then it has 8 threads (logical CPU)
due to 4-way simultaneous multithreading (SMT).

 

Has anyone else experienced this on AIX or Linux?

 

Thanks!

 

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Insane negative values for cpu_idle and cpu_wio when node is CPU bound

2013-10-02 Thread Khrist Hansen
Here is another example from gstat:
 CPUs (Procs/Total) [ 1, 5, 15min] [  User,  Nice, System, Idle,
Wio]
8 (8/  122) [  4.59,  2.04,  1.35] [  99.8,   0.0,
0.2,-67062349824.0,-67062349824.0] OFF

Looking at the source code for AIX metrics
(https://github.com/ganglia/monitor-core/blob/master/libmetrics/aix/metrics.
c), it appears that negative values should be converted to 0.  This is
either not happening or the metrics are somehow being modified after the
fact.

g_val_t
cpu_wio_func ( void )
{
   g_val_t val;
   
   get_cpuinfo();
   val.f = CALC_CPUINFO(wait);


   if(val.f  0) val.f = 0.0;
   return val;
}

g_val_t
cpu_idle_func ( void )
{
   g_val_t val;


   get_cpuinfo();
   val.f = CALC_CPUINFO(idle);


   if(val.f  0) val.f = 0.0;
   return val;
}


From: K. Hansen
Sent: Wednesday, October 02, 2013 4:50 PM
To: ganglia-general@lists.sourceforge.net
Subject: Insane negative values for cpu_idle and cpu_wio when node is CPU
bound

Environment:
AIX 6.1 TL7 SP7
gmond 3.6.0 (from http://www.perzl.org/ganglia/)

I noticed that a particular node would send insanely high negative values
for cpu_idle and cpu_wait metrics when cpu_user + cpu_system were near 100%,
i.e. the node is completely CPU bound.  The result is major skewing of the
node's cpu_idle and cpu_wio graphs so that no true positive values are
visible, and the cpu_report graph for the node, cluster, and grid become
corrupted.

Here is an example of what I am talking about:  http://imgur.com/a/aIzyU

I am able to replicate this behavior on any AIX node by running the
following command to generate CPU load:

perl -e 'while (--$ARGV[0] and fork) {}; while () {}' 8

Where the last digit is the number of threads available to the server.  For
example, if a server has 2 POWER7 vCPU, then it has 8 threads (logical CPU)
due to 4-way simultaneous multithreading (SMT).

Has anyone else experienced this on AIX or Linux?

Thanks!



--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general