On Mon, 2008-12-08 at 17:41 +0100, Rob van der Heij wrote:
> >> a production server). Not sure we've bothered to report the details since
> >> this problem
> >> would not impact our users. So the data still can not be used for serious
> >> performance
> >
> > The last time we talked, your tool used the Linux data as one input value of
> > your calculations. So if the Linux data is really wrong, any fix would
> > improve
> > the accuracy of your tool, no?
>
> I don't think the measurements based on CPU timer are more accurate
> than those based on TOD.
Sorry Rob but this is nonsense.
> For one thing because the CPU timer is less accurate than the TOD clock.
Principles of Operation chapter 4 about the CPU timer:
"The CPU timer is a binary counter with a format which is the same as
that of bits 0-63 of the TOD clock, except that bit 0 is considered a
sign. The CPU timer nominally is decremented by subtracting a one in bit
position 51 every microsecond."
I would call this as accurate as the TOD clock. The stepping rates are
not 100% the same if the TOD-clock-steering facility is installed but
the difference is very very small. By the way z/VM is using the same
mechanism to do its own cputime accounting.
> It's accurate enough when you measure a single virtual machine.
> But when the kernel is reloading the CPU timer again and again for
> each process or thread using a small amount of CPU, the error adds up
> very quick.
This statement is wrong. The CPU timer is reprogrammed when a CPU goes
idle, after it wakes up from idle, when a new earliest CPU timer event
is added and when a CPU timer event expires. Usually there are no CPU
timer events so we only reprogram the CPU timer going in and out of
idle. In particular the kernel does not reprogram the CPU timer for each
process. The overall error is minuscule, the following function programs
the CPU timer:
static inline void set_vtimer(__u64 expires)
{
__u64 timer;
asm volatile (" STPT %0\n" /* Store current cpu timer value */
" SPT %1" /* Set new value immediatly afterwards */
: "=m" (timer) : "m" (expires) );
S390_lowcore.system_timer += S390_lowcore.last_update_timer - timer;
S390_lowcore.last_update_timer = expires;
/* store expire time for this CPU timer */
__get_cpu_var(virt_cpu_timer).to_expire = expires;
}
The instruction to store the current value and the instruction to set
the new value are next to each other. You cannot do better.
There is one problem we recently identified and that is the cputime
spent by the idle process doing actual system work is accounted as idle
time instead of system time. I have a patch for this problem, it will go
upstream with the next merge window. The maximum difference I was able
to create with my testcases has been 0,35%.
> And because the CPU timer measures only in-SIE time, you miss the
> resources that CP and SIE spent on behalf of the virtual machine. Even
> when you don't measure it, someone still has to pay for it ;-)
This is called CP overhead and there are two cases. If CP wants to
account CPU time to the guest because it has done work on behalf of the
guest, it can simply add the time to the guest CPU timer in the SIE
control block before the guest cpu is restarted. The cputime spent by CP
for things not directly related to a guest should NOT be accounted to
the guest. This part of the CP overhead has to be accounted by z/VM.
> When I was diagnosing the customer problem, I did notice one bug in
> the kernel that probably could be fixed. But I did not have time yet
> to try that and see how big the difference would be. And in general,
> the additional code for dealing with the CPU timer makes the
> unmeasured part of time longer, so in general reduces the capture
> ratio.
How is the unmeasured part of the time longer? There is some overhead
for doing the improved Linux cputime accounting but the additional
instructions are fully accounted as cputime in Linux.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390