On Fri, Nov 13, 2009 at 6:05 PM, Michael MacIsaac <[email protected]> wrote: > >> As Rob and Alan have less blatantly stated, Linux CPU numbers are bogus >> in a virtual environment. > However, with the addition of "steal percentage" (%st in top), the amount of > CPU that is being "stolen" by the hipervisor, I believe many would agree > that they are less bogus.
I like your "less bogus" qualification. I'm trying to be more PC and call it "different" And if we're into word games; I don't like "steal" in this context. It suggests something you had was taken away. But in this case, you did not have it and it could not be taken from you :-) Most people *do* agree that you need both Linux and z/VM data to make sense of it or understand whether there is a problem. When someone claims to have wisdom in only a single metric, you normally don't have to try very hard to show him wrong. It is very easy to explain why the old Linux numbers were wrong (and by how much) when you had the z/VM data already. We use the VM monitor data to correct the Linux data. It is true that with the "virtual CPU time accounting" in Linux (that what produces the steal time) are not affected by that virtualization effect anymore. The numbers are still a bit off, but in normal situation the difference can be ignored. Unfortunately I often deal with abnormal situations where people have performance problems. In my "Understanding CPU Usage" presentation I show a case where z/VM claims the guest uses 30% of a CPU, Linux says it uses 6% of a CPU, and when you look for detailed per-process usage it adds up to 3% of a CPU (with the new improved numbers). I bring a stuffed penguin for someone in the audience who thinks Linux numbers are correct. And each time it goes back home with me ;-) This was indeed caused by a kernel bug. I think we identified 3 problems with the new CPU accounting in Linux because we match both numbers and want them to be correct. Eventually those bugs get fixed in your systems too. Whether the numbers are more correct or more often correct is not really the issue. I believe it is more important what the value of those numbers is for those who see it and whether they let you solve the performance problem. The reason the Linux admin looks at CPU usage is not because he is worried to wear out the CPU, but because he believes the rest is still available for him to use. In a virtualized environment it does not work like that. There's still a load of tools out there that don't show steal as part of the metrics, but just have user, nice, system. In that case it is better to see 99% to show the system is out of CPU than to see 25% and have no clue. But back to the original problem. If the 3 IFL's run at 10-15% we do not expect the old metrics in Linux to show 99%. There must be something else in the system causing this, and the monitor data would reveal the cause. Rob -- Rob van der Heij Velocity Software http://www.velocitysoftware.com/
