On Fri, Nov 13, 2009 at 6:05 PM, Michael MacIsaac <[email protected]> wrote:
>
>> As Rob and Alan have less blatantly stated, Linux CPU numbers are bogus
>> in a virtual environment.
> However, with the addition of "steal percentage" (%st in top), the amount of
> CPU that is being "stolen" by the hipervisor, I believe many would agree
> that they are less bogus.

I like your "less bogus" qualification. I'm trying to be more PC and
call it "different"   And if we're into word games; I don't like
"steal" in this context. It suggests something you had was taken away.
But in this case, you did not have it and it could not be taken from
you :-)

Most people *do* agree that you need both Linux and z/VM data to make
sense of it or  understand whether there is a problem. When someone
claims to have wisdom in only a single metric, you normally don't have
to try very hard to show him wrong.

It is very easy to explain why the old Linux numbers were wrong (and
by how much) when you had the z/VM data already. We use the VM monitor
data to correct the Linux data.
It is true that with the "virtual CPU time accounting" in Linux (that
what produces the steal time) are not affected by that virtualization
effect anymore. The numbers are still a bit off, but in normal
situation the difference can be ignored. Unfortunately I often deal
with abnormal situations where people have performance problems.

In my "Understanding CPU Usage" presentation I show a case where z/VM
claims the guest uses 30% of a CPU, Linux says it uses 6% of a CPU,
and when you look for detailed per-process usage it adds up to 3% of a
CPU (with the new improved numbers). I bring a stuffed penguin for
someone in the audience who thinks Linux numbers are correct. And each
time it goes back home with me ;-)  This was indeed caused by a kernel
bug. I think we identified 3 problems with the new CPU accounting in
Linux because we match both numbers and want them to be correct.
Eventually those bugs get fixed in your systems too.

Whether the numbers are more correct or more often correct is not
really the issue. I believe it is more important what the value of
those numbers is for those who see it and whether they let you solve
the performance problem. The reason the Linux admin looks at CPU usage
is not because he is worried to wear out the CPU, but because he
believes the rest is still available for him to use. In a virtualized
environment it does not work like that. There's still a load of tools
out there that don't show steal as part of the metrics, but just have
user, nice, system. In that case it is better to see 99% to show the
system is out of CPU than to see 25% and have no clue.

But back to the original problem. If the 3 IFL's run at 10-15% we do
not expect the old metrics in Linux to show 99%. There must be
something else in the system causing this, and the monitor data would
reveal the cause.

Rob
-- 
Rob van der Heij
Velocity Software
http://www.velocitysoftware.com/

Reply via email to