So, Martin, I learned a long time ago, that if the doc says 2+2 is 5, that 
don't make it
right.  Here is real data, we do understand it, and we do understand how to 
account for
the "error", which is why we don't push for a "fix".  So using native Linux 
tools, this
data would be off by factor of 7 if trying to account for CPU, and for what 
Linux should
account for, it is off by factor of 4.  Don't bet Rob any beverages (or include 
me on the
bet please), he only made it this bad to demonstrate he understood the problem, 
after a
real production issue showed up at an installation that cares about accurate 
data and
accounting.

"Linux claims to be idle 86% of time. From VM data I know that we run 100% 
TTIME and 50%
VTIME.".

Linux is using a complete IFL, 50% of it virtual, but only thinks it's using 
14% of it...
This should be enough of a clue for you....



Martin Schwidefsky wrote:

On Mon, 2008-12-08 at 17:41 +0100, Rob van der Heij wrote:

a production server). Not sure we've bothered to report the details since this 
problem
would not impact our users.  So the data still can not be used for serious 
performance

The last time we talked, your tool used the Linux data as one input value of
your calculations. So if the Linux data is really wrong, any fix would improve
the accuracy of your tool, no?

I don't think the measurements based on CPU timer are more accurate
than those based on TOD.


Sorry Rob but this is nonsense.


For one thing because the CPU timer is less accurate than the TOD clock.


Principles of Operation chapter 4 about the CPU timer:
"The CPU timer is a binary counter with a format which is the same as
that of bits 0-63 of the TOD clock, except that bit 0 is considered a
sign. The CPU timer nominally is decremented by subtracting a one in bit
position 51 every microsecond."

I would call this as accurate as the TOD clock. The stepping rates are
not 100% the same if the TOD-clock-steering facility is installed but
the difference is very very small. By the way z/VM is using the same
mechanism to do its own cputime accounting.


It's accurate enough when you measure a single virtual machine.
But when the kernel is reloading the CPU timer again and again for
each process or thread using a small amount of CPU, the error adds up
very quick.


This statement is wrong. The CPU timer is reprogrammed when a CPU goes
idle, after it wakes up from idle, when a new earliest CPU timer event
is added and when a CPU timer event expires. Usually there are no CPU
timer events so we only reprogram the CPU timer going in and out of
idle. In particular the kernel does not reprogram the CPU timer for each
process. The overall error is minuscule, the following function programs
the CPU timer:

static inline void set_vtimer(__u64 expires)
{
        __u64 timer;

        asm volatile ("  STPT %0\n"  /* Store current cpu timer value */
                      "  SPT %1"     /* Set new value immediatly afterwards */
                      : "=m" (timer) : "m" (expires) );
        S390_lowcore.system_timer += S390_lowcore.last_update_timer - timer;
        S390_lowcore.last_update_timer = expires;

        /* store expire time for this CPU timer */
        __get_cpu_var(virt_cpu_timer).to_expire = expires;
}

The instruction to store the current value and the instruction to set
the new value are next to each other. You cannot do better.

There is one problem we recently identified and that is the cputime
spent by the idle process doing actual system work is accounted as idle
time instead of system time. I have a patch for this problem, it will go
upstream with the next merge window. The maximum difference I was able
to create with my testcases has been 0,35%.


And because the CPU timer measures only in-SIE time, you miss the
resources that CP and SIE spent on behalf of the virtual machine. Even
when you don't measure it, someone still has to pay for it ;-)


This is called CP overhead and there are two cases. If CP wants to
account CPU time to the guest because it has done work on behalf of the
guest, it can simply add the time to the guest CPU timer in the SIE
control block before the guest cpu is restarted. The cputime spent by CP
for things not directly related to a guest should NOT be accounted to
the guest. This part of the CP overhead has to be accounted by z/VM.


When I was diagnosing the customer problem, I did notice one bug in
the kernel that probably could be fixed. But I did not have time yet
to try that and see how big the difference would be. And in general,
the additional code for dealing with the CPU timer makes the
unmeasured part of time longer, so in general reduces the capture
ratio.


How is the unmeasured part of the time longer? There is some overhead
for doing the improved Linux cputime accounting but the additional
instructions are fully accounted as cputime in Linux.

--
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390




----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
begin:vcard
fn:Barton Robinson
n:Robinson;Barton
adr;dom:;;PO 390640;Mountain View;CA;94039-0640
email;internet:[EMAIL PROTECTED]
title:Sr. Architect
tel;work:650-964-8867
note:If you can't measure it, I'm just not interested
x-mozilla-html:FALSE
url:http://velocitysoftware.com
version:2.1
end:vcard

Reply via email to