[Linux-HA] Antw: Re: pacemaker/HealthCPU

Ulrich Windl Fri, 04 Feb 2011 05:36:11 -0800

Lars,

you are right, and I saw that my guess to use /proc/stat was wrong. top is slow 
in getting the current CPU usage. So basically I wondered if you need the CPU 
usage at all. If you'd switch to "load", you could get it a lot faster.


To be honest: I wondered what "HealthCPU" would monitor about the CPU's 
"health" when initially looking into it. I was kind of disappointed to see that 
it simply inspects the CPU usage (A CPU that is 100% busy (0% idle) may be 
quite healthy) ;-)

Regards,
Ulrich

>>> Lars Ellenberg <[email protected]> schrieb am 04.02.2011 um 13:26 in
Nachricht <[email protected]>:
> On Thu, Feb 03, 2011 at 01:09:04PM +0100, Michael Schwartzkopff wrote:
> > On Thursday 03 February 2011 12:35:34 Ulrich Windl wrote:
> > > Hi!
> > > 
> > > I'm starting to explore Linux-HA. Examining one of the monitors, I think
> > > things could be made much more efficient. For example: To get the percent
> > > of idle CPU the monitor uses 4 processes: top -b -n2 | grep Cpu | tail -1
> > > | awk -F",|\.[0-9]%id" '{ print $4 }'
> > > 
> > > However awk can do the effect of grep and tail as well. My first attempt 
> is
> > > this: top -b -n2 | awk -F",|\.[0-9]%id" '/^Cpu/{ print $4; exit }'
> > > 
> > > My second attempt uses /proc/stat instead, avoiding the slow top process:
> > > awk '$1 == "cpu" { print $7; exit }' /proc/stat
> > > 
> > > time (top -b -n2 | grep Cpu | tail -1 | awk -F",|\.[0-9]%id" '{ print $4
> > > }') awk: warning: escape sequence `\.' treated as plain `.'
> > >  99
> > > 
> > > real    0m3.533s
> > > user    0m0.008s
> > > sys     0m0.008s
> > > 
> > >  time (top -b -n2| awk -F",|\.[0-9]%id" '/^Cpu/{ print $4; exit }')
> > > awk: warning: escape sequence `\.' treated as plain `.'
> > >  99
> 
> Outch. Big FAIL here already ;-)
> 
> > > 
> > > real    0m0.518s
> > > user    0m0.000s
> > > sys     0m0.008s
> > > 
> > > time awk '$1 == "cpu" { print $7; exit }' /proc/stat
> > > 98
> 
> And you actually believe that this was the equivalent of the above
> top | etc pipe ?
> 
> See below.
> 
> > > real    0m0.004s
> > > user    0m0.000s
> > > sys     0m0.000s
> > > 
> > > Regards,
> > > Ulrich
> > 
> > Hi,
> > 
> > good idea. The only problem it that the information in /proc/stats is in 
> ticks 
> > and does not give you an absolute value. So you would have to calculate the 
> 
> > difference yourself, which makes the task much more difficult.
> 
> 1) /proc/stat is linux specific.
> 2) /proc/stat is what top samles on linux ;-)
> 3) it is in "USER_HZ", so it's in centi secs,
>    which makes it easy enough to calculate a meaningfull difference.
>    besides, as long as it is _any_ consistent unit,
>    the unit does not matter, as it is in both nominator and denominator
>    ;-)
> 
> 4) time top -b -n2 | pipe
>    vs. time cat /proc/stats
>    ...
>    You do realize that to get any meaningfull measure about "current"
>    cpu usage, while the only measure readily available is "cpu usage
>    since system boot", you need to watch it for a while?
> 
> $ strace -tt -T -e read,select top -b -n2 2>&1 1>/dev/null |
>       grep -Ee ' read[(][0-9]*, "cpu | select[(]'
>   12:27:55.985830 read(3, "cpu  2146371 14 1345585 66007476"..., 8192) = 
> 1586 <0.000395>
>   12:27:56.065141 select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout) 
> <0.500579>
>   12:27:56.644309 read(5, "cpu  2146377 14 1345595 66007497"..., 1024) = 
> 1024 <0.000164>
>   12:27:56.645109 select(0, NULL, NULL, NULL, {3, 0}) = 0 (Timeout) <3.002688>
>   12:27:59.730617 read(5, "cpu  2146379 14 1345604 66007624"..., 1024) = 
> 1024 <0.000164>
> 
> [lars@soda:~/DRBD/drbd-8.3]$ strace -tt -T -e read,select top -b -n2 -d 0.1 
> 2>&1 
> 1>/dev/null |
>       grep -Ee ' read[(][0-9]*, "cpu | select[(]'
>   12:28:14.834497 read(3, "cpu  2146386 14 1345657 66008224"..., 8192) = 
> 1586 <0.000395>
>   12:28:14.919021 select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout) 
> <0.500582>
>   12:28:15.501349 read(5, "cpu  2146391 14 1345669 66008247"..., 1024) = 
> 1024 <0.000165>
>   12:28:15.502149 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) 
> <0.100165>
>   12:28:15.681910 read(5, "cpu  2146394 14 1345674 66008252"..., 1024) = 
> 1024 <0.000162>
> 
> Notice something? (mind the timestamp, and the -d option...)
> 
> Default delay of top on my setup seems to be 3 seconds.  If you want the 
> same
> "accuracy", then any replacement would need to sleep the same three seconds
> between two samplings of /proc/stat.
> 
> 
> So what you "measure" with your first variant (the print $4; exit)
> is the first "estimate" of top, after sampling /proc/stat twice with
> a hardcoded delay of 0.5 seconds, not waiting for the better, 3 seconds
> sampling period.  While this may be "good enough", it is certainly not
> equivalent.
> 
> You could drop the exit from awk, and do top -b -n1 to achieve the same.
> 
> And exactly what do you think you measure with
>   awk '$1 == "cpu" { print $7; exit }' /proc/stat ?
> column 7 is cumulative centiseconds spent in (hard)irq since system boot.
> What would the HealthCPU agent do with that?
> 
> What you would need to do is sample /proc/stat twice, with a delay of,
> say, 1 to 3 seconds, calculate the relative cpu time spent in idle,
> and get that calculation right.
> That certainly can be done in bash, even from bash,
> though to grab the "^cpu " line from stat I'll use grep anyways,
> that's faster than trying a "while read a b c ... case a in "cpu "*) ..."
> at least in my experience.
> 
> #!/bin/bash
> percent_cpu_spent_doing_stuff()
> {
>     local delay=${1:-2}
>     set -- $(grep '^cpu ' < /proc/stat );
>     shift;
>     local u=${1:-0} n=${2:-0} s=${3:-0} i=${4:-0} io=${5:-0} irq=${6:-0} 
> softirq=${7:-0} steal=${8:-0} guest=${9:-0} guest_nice=${10:-0};
>     sleep $delay;
>     set -- $(grep '^cpu ' < /proc/stat );
>     shift;
>     local sum0=$[u+n+s+i+io+irq+softirq+steal+guest+guest_nice];
>     local sum1=$[${1:-0} + ${2:-0} + ${3:-0} + ${4:-0} + ${5:-0} + ${6:-0} + 
> ${7:-0} + ${8:-0} + ${9:-0} + ${10:-0}];
>     local d=$[sum1-sum0];
>     echo "$sum1 - $sum0 = $d";
>     u=$[${1:-0} - u] n=$[${2:-0} - n] s=$[${3:-0} - s] i=$[${4:-0} - i];
>     io=$[${5:-0} - io] irq=$[${6:-0} - irq] softirq=$[${7:-0} - softirq];
>     steal=$[${8:-0} - steal] guest=$[${9:-0} - guest] guest_nice=$[${10:-0} - 
> guest_nice];
>     local x;
>     for x in u n s i io irq softirq steal guest guest_nice;
>     do
>         printf "%12s = %3d%%\n" $x $[${!x}*100/d];
>     done
> }
> 
> percent_cpu_spent_doing_stuff
> 
> But why would we want to do that?
> 
> Would it be sufficient to add a "delay" parameter to the HealthCPU agent,
> and pass it to "top -b -n2 -d $delay"?  So anyone who wants a completely
> useless erratically fluctuating cpu usage measure can use delay=0.1,
> and others can pass in delay=10 ?
> 



 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Antw: Re: pacemaker/HealthCPU

Reply via email to