[Linux-HA] Antw: Re: pacemaker/HealthCPU
Lars, you are right, and I saw that my guess to use /proc/stat was wrong. top is slow in getting the current CPU usage. So basically I wondered if you need the CPU usage at all. If you'd switch to load, you could get it a lot faster. To be honest: I wondered what HealthCPU would monitor about the CPU's health when initially looking into it. I was kind of disappointed to see that it simply inspects the CPU usage (A CPU that is 100% busy (0% idle) may be quite healthy) ;-) Regards, Ulrich Lars Ellenberg lars.ellenb...@linbit.com schrieb am 04.02.2011 um 13:26 in Nachricht 20110204122642.GG10069@barkeeper1-xen.linbit: On Thu, Feb 03, 2011 at 01:09:04PM +0100, Michael Schwartzkopff wrote: On Thursday 03 February 2011 12:35:34 Ulrich Windl wrote: Hi! I'm starting to explore Linux-HA. Examining one of the monitors, I think things could be made much more efficient. For example: To get the percent of idle CPU the monitor uses 4 processes: top -b -n2 | grep Cpu | tail -1 | awk -F,|\.[0-9]%id '{ print $4 }' However awk can do the effect of grep and tail as well. My first attempt is this: top -b -n2 | awk -F,|\.[0-9]%id '/^Cpu/{ print $4; exit }' My second attempt uses /proc/stat instead, avoiding the slow top process: awk '$1 == cpu { print $7; exit }' /proc/stat time (top -b -n2 | grep Cpu | tail -1 | awk -F,|\.[0-9]%id '{ print $4 }') awk: warning: escape sequence `\.' treated as plain `.' 99 real0m3.533s user0m0.008s sys 0m0.008s time (top -b -n2| awk -F,|\.[0-9]%id '/^Cpu/{ print $4; exit }') awk: warning: escape sequence `\.' treated as plain `.' 99 Outch. Big FAIL here already ;-) real0m0.518s user0m0.000s sys 0m0.008s time awk '$1 == cpu { print $7; exit }' /proc/stat 98 And you actually believe that this was the equivalent of the above top | etc pipe ? See below. real0m0.004s user0m0.000s sys 0m0.000s Regards, Ulrich Hi, good idea. The only problem it that the information in /proc/stats is in ticks and does not give you an absolute value. So you would have to calculate the difference yourself, which makes the task much more difficult. 1) /proc/stat is linux specific. 2) /proc/stat is what top samles on linux ;-) 3) it is in USER_HZ, so it's in centi secs, which makes it easy enough to calculate a meaningfull difference. besides, as long as it is _any_ consistent unit, the unit does not matter, as it is in both nominator and denominator ;-) 4) time top -b -n2 | pipe vs. time cat /proc/stats ... You do realize that to get any meaningfull measure about current cpu usage, while the only measure readily available is cpu usage since system boot, you need to watch it for a while? $ strace -tt -T -e read,select top -b -n2 21 1/dev/null | grep -Ee ' read[(][0-9]*, cpu | select[(]' 12:27:55.985830 read(3, cpu 2146371 14 1345585 66007476..., 8192) = 1586 0.000395 12:27:56.065141 select(0, NULL, NULL, NULL, {0, 50}) = 0 (Timeout) 0.500579 12:27:56.644309 read(5, cpu 2146377 14 1345595 66007497..., 1024) = 1024 0.000164 12:27:56.645109 select(0, NULL, NULL, NULL, {3, 0}) = 0 (Timeout) 3.002688 12:27:59.730617 read(5, cpu 2146379 14 1345604 66007624..., 1024) = 1024 0.000164 [lars@soda:~/DRBD/drbd-8.3]$ strace -tt -T -e read,select top -b -n2 -d 0.1 21 1/dev/null | grep -Ee ' read[(][0-9]*, cpu | select[(]' 12:28:14.834497 read(3, cpu 2146386 14 1345657 66008224..., 8192) = 1586 0.000395 12:28:14.919021 select(0, NULL, NULL, NULL, {0, 50}) = 0 (Timeout) 0.500582 12:28:15.501349 read(5, cpu 2146391 14 1345669 66008247..., 1024) = 1024 0.000165 12:28:15.502149 select(0, NULL, NULL, NULL, {0, 10}) = 0 (Timeout) 0.100165 12:28:15.681910 read(5, cpu 2146394 14 1345674 66008252..., 1024) = 1024 0.000162 Notice something? (mind the timestamp, and the -d option...) Default delay of top on my setup seems to be 3 seconds. If you want the same accuracy, then any replacement would need to sleep the same three seconds between two samplings of /proc/stat. So what you measure with your first variant (the print $4; exit) is the first estimate of top, after sampling /proc/stat twice with a hardcoded delay of 0.5 seconds, not waiting for the better, 3 seconds sampling period. While this may be good enough, it is certainly not equivalent. You could drop the exit from awk, and do top -b -n1 to achieve the same. And exactly what do you think you measure with awk '$1 == cpu { print $7; exit }' /proc/stat ? column 7 is cumulative centiseconds spent in (hard)irq since system boot. What would the HealthCPU agent do with that? What you would need to do is sample /proc/stat twice, with a delay of, say, 1 to 3 seconds, calculate the relative cpu time spent in idle, and get
Re: [Linux-HA] Antw: Re: pacemaker/HealthCPU
On Friday 04 February 2011 14:35:35 Ulrich Windl wrote: Lars, you are right, and I saw that my guess to use /proc/stat was wrong. top is slow in getting the current CPU usage. So basically I wondered if you need the CPU usage at all. If you'd switch to load, you could get it a lot faster. To be honest: I wondered what HealthCPU would monitor about the CPU's health when initially looking into it. I was kind of disappointed to see that it simply inspects the CPU usage (A CPU that is 100% busy (0% idle) may be quite healthy) ;-) Regards, Ulrich Please consider the CPUHealth RA as a first try. Useful patches are always welcome. Greetings, -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98 signature.asc Description: This is a digitally signed message part. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: pacemaker/HealthCPU
Lars Ellenberg wrote: On Fri, Feb 04, 2011 at 02:35:35PM +0100, Ulrich Windl wrote: Lars, you are right, and I saw that my guess to use /proc/stat was wrong. top is slow in getting the current CPU usage. So basically I wondered if you need the CPU usage at all. If you'd switch to load, you could get it a lot faster. To be honest: I wondered what HealthCPU would monitor about the CPU's health when initially looking into it. I was kind of disappointed to see that it simply inspects the CPU usage (A CPU that is 100% busy (0% idle) may be quite healthy) ;-) Health was arguably a bad choice, Utilization may have been more appropriate. But try to define cpu health... Of course, with 4+-core CPUs, you'd very rarely see all of them at 100% busy. Especially when it only takes one to saturate your i/o bus. Dima -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Antw: Re: pacemaker/HealthCPU
Michael Schwartzkopff mi...@clusterbau.com schrieb am 03.02.2011 um 13:09 in Nachricht 201102031309.04931.mi...@clusterbau.com: On Thursday 03 February 2011 12:35:34 Ulrich Windl wrote: Hi! I'm starting to explore Linux-HA. Examining one of the monitors, I think things could be made much more efficient. For example: To get the percent of idle CPU the monitor uses 4 processes: top -b -n2 | grep Cpu | tail -1 | awk -F,|\.[0-9]%id '{ print $4 }' However awk can do the effect of grep and tail as well. My first attempt is this: top -b -n2 | awk -F,|\.[0-9]%id '/^Cpu/{ print $4; exit }' My second attempt uses /proc/stat instead, avoiding the slow top process: awk '$1 == cpu { print $7; exit }' /proc/stat time (top -b -n2 | grep Cpu | tail -1 | awk -F,|\.[0-9]%id '{ print $4 }') awk: warning: escape sequence `\.' treated as plain `.' 99 real0m3.533s user0m0.008s sys 0m0.008s time (top -b -n2| awk -F,|\.[0-9]%id '/^Cpu/{ print $4; exit }') awk: warning: escape sequence `\.' treated as plain `.' 99 real0m0.518s user0m0.000s sys 0m0.008s time awk '$1 == cpu { print $7; exit }' /proc/stat 98 real0m0.004s user0m0.000s sys 0m0.000s Regards, Ulrich Hi, good idea. The only problem it that the information in /proc/stats is in ticks and does not give you an absolute value. So you would have to calculate the difference yourself, which makes the task much more difficult. OK, what about this: time (procinfo | awk '$1 == idle $2 == : { if (sub(%, , $5)) { print $5 } else { sub(%, , $4); print $4} }') 99.4 real0m0.010s user0m0.000s sys 0m0.000s # Maybe use print int($x) to see an integer (I didn't know the details of /proc/stat. Those who want to might read: /usr/src/linux/kernel/sched.c, /usr/src/linux/include/linux/kernel_stat.h, /usr/src/linux/include/asm-generic/cputime.h) Regards, Ulrich P.S: The lines that are processed look like this: system: 1:21:29.98 0.6% page act:2141840 IOwait: 3:58:58.89 1.9% page dea:1635314 hw irq: 0:02:17.38 0.0% page flt: 766785934 sw irq: 0:13:44.90 0.1% swap in :562 idle : 7d 21:37:23.52 90.4% swap out:899 uptime: 2d 4:25:34.97 context : 119011606 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: pacemaker/HealthCPU
On Thursday 03 February 2011 16:00:54 Ulrich Windl wrote: Michael Schwartzkopff mi...@clusterbau.com schrieb am 03.02.2011 um 13:09 in Nachricht 201102031309.04931.mi...@clusterbau.com: On Thursday 03 February 2011 12:35:34 Ulrich Windl wrote: Hi! I'm starting to explore Linux-HA. Examining one of the monitors, I think things could be made much more efficient. For example: To get the percent of idle CPU the monitor uses 4 processes: top -b -n2 | grep Cpu | tail -1 | awk -F,|\.[0-9]%id '{ print $4 }' However awk can do the effect of grep and tail as well. My first attempt is this: top -b -n2 | awk -F,|\.[0-9]%id '/^Cpu/{ print $4; exit }' My second attempt uses /proc/stat instead, avoiding the slow top process: awk '$1 == cpu { print $7; exit }' /proc/stat time (top -b -n2 | grep Cpu | tail -1 | awk -F,|\.[0-9]%id '{ print $4 }') awk: warning: escape sequence `\.' treated as plain `.' 99 real0m3.533s user0m0.008s sys 0m0.008s time (top -b -n2| awk -F,|\.[0-9]%id '/^Cpu/{ print $4; exit }') awk: warning: escape sequence `\.' treated as plain `.' 99 real0m0.518s user0m0.000s sys 0m0.008s time awk '$1 == cpu { print $7; exit }' /proc/stat 98 real0m0.004s user0m0.000s sys 0m0.000s Regards, Ulrich Hi, good idea. The only problem it that the information in /proc/stats is in ticks and does not give you an absolute value. So you would have to calculate the difference yourself, which makes the task much more difficult. OK, what about this: time (procinfo | awk '$1 == idle $2 == : { if (sub(%, , $5)) { print $5 } else { sub(%, , $4); print $4} }') 99.4 real0m0.010s user0m0.000s sys 0m0.000s # Maybe use print int($x) to see an integer (I didn't know the details of /proc/stat. Those who want to might read: /usr/src/linux/kernel/sched.c, /usr/src/linux/include/linux/kernel_stat.h, /usr/src/linux/include/asm-generic/cputime.h) Regards, Ulrich P.S: The lines that are processed look like this: system: 1:21:29.98 0.6% page act:2141840 IOwait: 3:58:58.89 1.9% page dea:1635314 hw irq: 0:02:17.38 0.0% page flt: 766785934 sw irq: 0:13:44.90 0.1% swap in :562 idle : 7d 21:37:23.52 90.4% swap out:899 uptime: 2d 4:25:34.97 context : 119011606 Ok.l That looks better. Here I see the problem that the procinfo package might not be installed on all cluster nodes and thus the resource-agent package would have to be made depended on this package. -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98 signature.asc Description: This is a digitally signed message part. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Antw: Re: pacemaker/HealthCPU
Soffen, Matthew msof...@iso-ne.com schrieb am 03.02.2011 um 16:35 in Nachricht e847bfef193361409d48010ec8ace3bc01703...@exchangebe.iso-ne.com: Morning All, Please also keep in mind that /proc/stat is ONLY in Linux and Linux-HA despite the name is also used on FreeBSD and Solaris. Hi! Good thought: I was wondering whether the output format on other systems matches that of Linux. Anyway, the other solutions are much faster than the original. Maybe a $(uname) will help. For example a Perl script that is supposed to get the list of running processes on HP-UX and Linux (SLES10 SP3) uses a hash table like this: use constant PS_CONF = { # PID TTY TIME COMMAND # 6714 ? 223:15 dw.sapC11_DVEBMGS00 pf=/usr/sap/C11/... bla bla 'hpux' = ['ps -ex', qr/^\s*(\d+)\s+\S+\s+\d+:\d+\s+(\S+)(.*)$/, qr/^init$/], # PID TTY STAT TIME COMMAND # 15046 ?S110:29 dw.sapC11_D09 pf=/usr/sap/C11/... bla bla 'linux' = ['ps ax', qr/^\s*(\d+)\s+\S+\s+\S+\s+\d+:\d+\s+(\S+)(.*)$/, qr/^init$/], }; # OS($^O)-dependent ps configuration my $ps_conf = PS_CONF-{$^O}; Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems