On Tue, Jan 13, 2009 at 03:47:49PM +0800, Sheng Yang wrote:
> On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> The original apic_timer_intr_post() got:
> 
> >         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> >                 apic->timer.last_update = ktime_add_ns(
> >                                 apic->timer.last_update,
> >                                 apic->timer.period);
> 
> I think the purpose is let guest see a reasonable TMCCT. Which means:
> 
> 1. Timer start to fire at start_apic_timer(). last_update=now(). 
> 
> 2. Every one LAPIC timer interrupt was injected into the guest, so the time 
> at 
> least elapsed timer.period(the first alarm set at "period" later), then 
> last_update is increased by period. Notice last_update's base is set before 
> timer fire, so it's not indicated the next one, but *the time this interrupt 
> should be injected*(time can be delayed)... So last_update = n*period + base.

Right.

> 3. If there is pending interrupt, last_update won't be updated, so we have 
> this:
> 
> >                         while (counter_passed > tmcct)
> >                                 counter_passed -= tmcct;
> >                         tmcct -= counter_passed;
> (notice tmcct is TMICT here.)
> 
> And the returned tmcct value indicated (counter_passed % tmict), a offset 
> regardless of pending interrupt numbers.

Right. The problem is that for accumulated interrupts, the guest will
receive the interrupt as fast as the qemu process can be scheduled (as
long as its not masked, of course). There could be higher priority
vectors in the mix, but thats not the common case.

So calculating the offset using last interrupt injection is not very 
reasonable in this case.

> I think now the overflow seems OK, but I am not sure why last_update can be 
> bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or 
> boundary condition existed...

Yep, it seems to be some small inaccuracy in the accounting that
causes it. 

> And base on this, I don't think my quick fix is correct...
> 
> >
> > 3) And then there's interrupt reinjection. Once interrupts accumulate,
> > and we reinject, current count reads become completly bogus. This is
> > the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> > algorithm calculates the interrupt handling delay by using the PIT
> > count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> > hrtimer expiration time, which has nothing to do with the accumulated
> > interrupts.
> >
> > So what I propose is to first switch lapic current count emulation to a
> > straightforward scaled hrtimer_get_remaining equivalent.
> >
> > For the reinjection case, maintain an average of the delay between host
> > interrupt and interrupt injection (this can be generic, so both PIT
> > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> > emulation whenever pending > 2.
> >
> > What you think?
> 
> 1. If we simply use ktime_get() to update last_update, the interval can't be 
> same between different last_update. I think we may got some problem here, but 
> not for sure. For example, three delayed interrupt was injected one after 
> one, 
> last_update would show very little interval, and APIC_TMCCT may got trouble.  

I mean get rid of last_update and use hrtimer_get_remaining on
apic_get_tmcct for no pending interrupts. And estimated delay between
timer fire and injection if there are pending interrupts.

> Maintain an average of the delay seems a little tricky here, and I am not 
> sure 
> if it would help the problem...  Average is average at most, not the real 
> affection at the time...

Yes. Probably using the last delay between timer fire and injection is
more accurate than an average.

> 
> 2. For the current mechanism, the interval is the same, so last_update always 
> equal to n*period + base. If there are more than one pending interrupts, 
> TMCCT 
> should also can return the relative correct value.
> 
> So I am leaning toward to fix current problem, though I haven't find the root 
> cause yet...

I think it can be simpler, without the need to deal with overflow at
all. 

Understanding the present root cause is important though.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to