On Wednesday 14 January 2009 06:01:17 Marcelo Tosatti wrote:
> On Tue, Jan 13, 2009 at 03:47:49PM +0800, Sheng Yang wrote:
> > On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> >
> > The original apic_timer_intr_post() got:
> > >         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > >                 apic->timer.last_update = ktime_add_ns(
> > >                                 apic->timer.last_update,
> > >                                 apic->timer.period);
> >
> > I think the purpose is let guest see a reasonable TMCCT. Which means:
> >
> > 1. Timer start to fire at start_apic_timer(). last_update=now().
> >
> > 2. Every one LAPIC timer interrupt was injected into the guest, so the
> > time at least elapsed timer.period(the first alarm set at "period"
> > later), then last_update is increased by period. Notice last_update's
> > base is set before timer fire, so it's not indicated the next one, but
> > *the time this interrupt should be injected*(time can be delayed)... So
> > last_update = n*period + base.
>
> Right.
>
> > 3. If there is pending interrupt, last_update won't be updated, so we
> > have
> >
> > this:
> > >                         while (counter_passed > tmcct)
> > >                                 counter_passed -= tmcct;
> > >                         tmcct -= counter_passed;
> >
> > (notice tmcct is TMICT here.)
> >
> > And the returned tmcct value indicated (counter_passed % tmict), a offset
> > regardless of pending interrupt numbers.
>
> Right. The problem is that for accumulated interrupts, the guest will
> receive the interrupt as fast as the qemu process can be scheduled (as
> long as its not masked, of course). There could be higher priority
> vectors in the mix, but thats not the common case.
>
> So calculating the offset using last interrupt injection is not very
> reasonable in this case.

Um... I think it's not easy to find a complete reasonable result here, I 
haven't look into how OS us TMCCT, maybe it can help us to find a more 
reasonable result here.

> > I think now the overflow seems OK, but I am not sure why last_update can
> > be bigger than ktime_get()? Maybe due to vcpu migration? Suspect some
> > racy or boundary condition existed...
>
> Yep, it seems to be some small inaccuracy in the accounting that
> causes it.
>
> > And base on this, I don't think my quick fix is correct...
> >
> > > 3) And then there's interrupt reinjection. Once interrupts accumulate,
> > > and we reinject, current count reads become completly bogus. This is
> > > the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> > > algorithm calculates the interrupt handling delay by using the PIT
> > > count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> > > hrtimer expiration time, which has nothing to do with the accumulated
> > > interrupts.
> > >
> > > So what I propose is to first switch lapic current count emulation to a
> > > straightforward scaled hrtimer_get_remaining equivalent.
> > >
> > > For the reinjection case, maintain an average of the delay between host
> > > interrupt and interrupt injection (this can be generic, so both PIT
> > > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> > > emulation whenever pending > 2.
> > >
> > > What you think?
> >
> > 1. If we simply use ktime_get() to update last_update, the interval can't
> > be same between different last_update. I think we may got some problem
> > here, but not for sure. For example, three delayed interrupt was injected
> > one after one, last_update would show very little interval, and
> > APIC_TMCCT may got trouble.
>
> I mean get rid of last_update and use hrtimer_get_remaining on
> apic_get_tmcct for no pending interrupts. And estimated delay between
> timer fire and injection if there are pending interrupts.
>
> > Maintain an average of the delay seems a little tricky here, and I am not
> > sure if it would help the problem...  Average is average at most, not the
> > real affection at the time...
>
> Yes. Probably using the last delay between timer fire and injection is
> more accurate than an average.
>
> > 2. For the current mechanism, the interval is the same, so last_update
> > always equal to n*period + base. If there are more than one pending
> > interrupts, TMCCT should also can return the relative correct value.
> >
> > So I am leaning toward to fix current problem, though I haven't find the
> > root cause yet...
>
> I think it can be simpler, without the need to deal with overflow at
> all.
>
> Understanding the present root cause is important though.

Yes, but still don't have many clues now... I would look through the code and 
purpose some patches for Alex to test...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to