I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
short of it is that all of them keep time quite well with 1 vcpu. In the
case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
smp kernels, again with only 1 vcpu (there's no up/smp distinction in
the kernels for RHEL5).

As soon as the number of vcpus is >1, time drifts systematically with
the guest *leading* the host. I see this on unloaded guests and hosts
(ie., cpu usage on the host ~<5%). The drift is averaging around
0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
wall time).

This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
and 5.2, i386 versions, starting them and watching the drift with no
time servers. In all of these recent cases the results are for in-kernel
pit.

more in-line below.


Marcelo Tosatti wrote:
> On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
>>> All time drift issues we were aware of are fixed in kvm-70. Can you
>>> please provide more details on how you see the time drifting with
>>> RHEL3/4 guests? It slowly but continually drifts or there are large
>>> drifts at once? Are they using TSC or ACPIPM as clocksource?
>> The attached file shows one example of the drift I am seeing. It's for a
>> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
>> pinned to a physical cpu using taskset. The only activity on the host is
>> this one single guest; the guest is relatively idle -- about 4% activity
>> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
>> guest is not. The guest is started with the -localtime parameter.  From
>> the file you can see the guest gains about 1-2 seconds every 5 minutes.
>>
>> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
>> confirm?), though it does read the TSC (ie., use_tsc is 1).
> 
> Since its an SMP guest I believe its using PIT to generate periodic
> timers and ACPI pmtimer as a clock source.

Since my last post, I've been reading up on timekeeping and going
through the kernel code -- focusing on RHEL3 at the moment. AFAICT the
PIT is used for timekeeping, and the local APIC timer interrupts are
used as well (supposedly just for per-cpu system accounting, though I
have not gone through all of the code yet). I do not see references in
dmesg data regarding pmtimer; I thought RHEL3 was not ACPI aware.

> 
>>> Also, most issues we've seen could only be replicated with dyntick
>>> guests.
>>>
>>> I'll try to reproduce it locally.
>>>
>>>> In the course of it I have been launching guests with boosted priority
>>>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>>>> host.
>>> Can you also see wacked bogomips without boosting the guest priority?
>> The wacked bogomips only shows up when started with real-time priority.
>> With the 'nice -20' it's sane and close to what the host shows.
>>
>> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
>> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
>> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
>> guest time is always fast compared to the host.
>>
>> I've seen similar drifting in RHEL4 guests, but I have not spent as much
>> time investigating it yet. On ESX adding clock=pit to the boot
>> parameters for RHEL4 guests helps immensely.
> 
> The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
> tick and irq latency adjustments, as mentioned in the VMWare paper
> (http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
> this and compensate by advancing the clock. But the delay between the
> host time fire, injection of guest irq and actual count read (either
> tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
> detection, so is susceptible to lost ticks under load (in theory).

I have read that document quite a few times; clock=pit is required on
esx for rhel4 guests to be sane.

> 
> The fact that qemu emulation is less suspectible to guest clock running
> faster than it should is because the emulated PIT timer is rearmed
> relative to alarm processing (next_expiration = current_time + count).
> But that also means it is suspectible to host load, ie. the frequency is
> virtual.
> 
> The in-kernel PIT rearms relative to host clock, so the frequency is
> more reliable (next_expiration = prev_expiration + count).
> 
> So for RHEL4, clock=pit along with the following patch seems stable for
> me, no drift either direction, even under guest/host load. Can you give
> it a try with RHEL3 ? I'll be doing that shortly.

I'll give it a shot and let you know.

david

> 
> 
> ----------
> 
> Set the count load time to when the count is actually "loaded", not when
> IRQ is injected.
> 
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index c0f7872..b39b141 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>  
>       pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>       pt->scheduled = ktime_to_ns(pt->timer.expires);
> +     ps->channels[0].count_load_time = pt->timer.expires;
>  
>       return (pt->period == 0 ? 0 : 1);
>  }
> @@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int 
> vec)
>                 arch->vioapic->redirtbl[0].fields.mask != 1))) {
>                       ps->inject_pending = 1;
>                       atomic_dec(&ps->pit_timer.pending);
> -                     ps->channels[0].count_load_time = ktime_get();
>               }
>       }
>  }
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to