Re: [Ganglia-general] Insane negative values for cpu_idle and cpu_wio when node is CPU bound
This appears to be an issue with Mr. Perzl's updated libperfstat code borrowed from IBM's perfstat_cpu_total example. ftp://www.oss4aix.org/ganglia/RPMs-3.3.7/src/ganglia-3.3.7-aix.patch http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.prftools /doc/prftools/idprftools_perfstat_glob_cpu.htm When calculating wio and idle, the code performs a divide operation with (dlt_lcpu_wait + dlt_lcpu_idle) as the divisor. If the server is CPU bound, i.e. usr+sys=100%, then both dlt_lcpu_wait and dlt_lcpu_idle will be zero, and the division will occur with zero as the divisor. This should be a fairly simple fix, and I am attempting to contact Mr. Perzl to that effect. -Original Message- From: Khrist Hansen [mailto:khrist.han...@gmail.com] Sent: Wednesday, October 02, 2013 6:18 PM To: ganglia-general@lists.sourceforge.net Subject: RE: Insane negative values for cpu_idle and cpu_wio when node is CPU bound Here is another example from gstat: CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle, Wio] 8 (8/ 122) [ 4.59, 2.04, 1.35] [ 99.8, 0.0, 0.2,-67062349824.0,-67062349824.0] OFF Looking at the source code for AIX metrics (https://github.com/ganglia/monitor-core/blob/master/libmetrics/aix/metrics. c), it appears that negative values should be converted to 0. This is either not happening or the metrics are somehow being modified after the fact. g_val_t cpu_wio_func ( void ) { g_val_t val; get_cpuinfo(); val.f = CALC_CPUINFO(wait); if(val.f 0) val.f = 0.0; return val; } g_val_t cpu_idle_func ( void ) { g_val_t val; get_cpuinfo(); val.f = CALC_CPUINFO(idle); if(val.f 0) val.f = 0.0; return val; } From: K. Hansen Sent: Wednesday, October 02, 2013 4:50 PM To: ganglia-general@lists.sourceforge.net Subject: Insane negative values for cpu_idle and cpu_wio when node is CPU bound Environment: AIX 6.1 TL7 SP7 gmond 3.6.0 (from http://www.perzl.org/ganglia/) I noticed that a particular node would send insanely high negative values for cpu_idle and cpu_wait metrics when cpu_user + cpu_system were near 100%, i.e. the node is completely CPU bound. The result is major skewing of the node's cpu_idle and cpu_wio graphs so that no true positive values are visible, and the cpu_report graph for the node, cluster, and grid become corrupted. Here is an example of what I am talking about: http://imgur.com/a/aIzyU I am able to replicate this behavior on any AIX node by running the following command to generate CPU load: perl -e 'while (--$ARGV[0] and fork) {}; while () {}' 8 Where the last digit is the number of threads available to the server. For example, if a server has 2 POWER7 vCPU, then it has 8 threads (logical CPU) due to 4-way simultaneous multithreading (SMT). Has anyone else experienced this on AIX or Linux? Thanks! -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Insane negative values for cpu_idle and cpu_wio when node is CPU bound
Environment: AIX 6.1 TL7 SP7 gmond 3.6.0 (from http://www.perzl.org/ganglia/) I noticed that a particular node would send insanely high negative values for cpu_idle and cpu_wait metrics when cpu_user + cpu_system were near 100%, i.e. the node is completely CPU bound. The result is major skewing of the node's cpu_idle and cpu_wio graphs so that no true positive values are visible, and the cpu_report graph for the node, cluster, and grid become corrupted. Here is an example of what I am talking about: http://imgur.com/a/aIzyU I am able to replicate this behavior on any AIX node by running the following command to generate CPU load: perl -e 'while (--$ARGV[0] and fork) {}; while () {}' 8 Where the last digit is the number of threads available to the server. For example, if a server has 2 POWER7 vCPU, then it has 8 threads (logical CPU) due to 4-way simultaneous multithreading (SMT). Has anyone else experienced this on AIX or Linux? Thanks! -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Insane negative values for cpu_idle and cpu_wio when node is CPU bound
Here is another example from gstat: CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle, Wio] 8 (8/ 122) [ 4.59, 2.04, 1.35] [ 99.8, 0.0, 0.2,-67062349824.0,-67062349824.0] OFF Looking at the source code for AIX metrics (https://github.com/ganglia/monitor-core/blob/master/libmetrics/aix/metrics. c), it appears that negative values should be converted to 0. This is either not happening or the metrics are somehow being modified after the fact. g_val_t cpu_wio_func ( void ) { g_val_t val; get_cpuinfo(); val.f = CALC_CPUINFO(wait); if(val.f 0) val.f = 0.0; return val; } g_val_t cpu_idle_func ( void ) { g_val_t val; get_cpuinfo(); val.f = CALC_CPUINFO(idle); if(val.f 0) val.f = 0.0; return val; } From: K. Hansen Sent: Wednesday, October 02, 2013 4:50 PM To: ganglia-general@lists.sourceforge.net Subject: Insane negative values for cpu_idle and cpu_wio when node is CPU bound Environment: AIX 6.1 TL7 SP7 gmond 3.6.0 (from http://www.perzl.org/ganglia/) I noticed that a particular node would send insanely high negative values for cpu_idle and cpu_wait metrics when cpu_user + cpu_system were near 100%, i.e. the node is completely CPU bound. The result is major skewing of the node's cpu_idle and cpu_wio graphs so that no true positive values are visible, and the cpu_report graph for the node, cluster, and grid become corrupted. Here is an example of what I am talking about: http://imgur.com/a/aIzyU I am able to replicate this behavior on any AIX node by running the following command to generate CPU load: perl -e 'while (--$ARGV[0] and fork) {}; while () {}' 8 Where the last digit is the number of threads available to the server. For example, if a server has 2 POWER7 vCPU, then it has 8 threads (logical CPU) due to 4-way simultaneous multithreading (SMT). Has anyone else experienced this on AIX or Linux? Thanks! -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general