Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

Blue Swirl Sat, 29 May 2010 10:01:49 -0700

2010/5/28 Gleb Natapov <g...@redhat.com>:
> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>> 2010/5/28 Gleb Natapov <g...@redhat.com>:
>> > On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>> >> 2010/5/27 Gleb Natapov <g...@redhat.com>:
>> >> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> >> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kis...@web.de> wrote:
>> >> >> > Blue Swirl wrote:
>> >> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kis...@web.de> 
>> >> >> >> wrote:
>> >> >> >>> Anthony Liguori wrote:
>> >> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kis...@web.de>  
>> >> >> >>>>> wrote:
>> >> >> >>>>>
>> >> >> >>>>>> From: Jan Kiszka<jan.kis...@siemens.com>
>> >> >> >>>>>>
>> >> >> >>>>>> This allows to communicate potential IRQ coalescing during 
>> >> >> >>>>>> delivery from
>> >> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >> >> >>>>>> workarounds need to register handlers that return the 
>> >> >> >>>>>> appropriate
>> >> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across 
>> >> >> >>>>>> all IRQ
>> >> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, 
>> >> >> >>>>>> it can
>> >> >> >>>>>> apply its workaround. If multiple sinks exist, the source may 
>> >> >> >>>>>> only
>> >> >> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >> >> >>>>>>
>> >> >> >>>>> No real devices are interested whether any of their output lines 
>> >> >> >>>>> are
>> >> >> >>>>> even connected. This would introduce a new signal type, 
>> >> >> >>>>> bidirectional
>> >> >> >>>>> multi-level, which is not correct.
>> >> >> >>>>>
>> >> >> >>>> I don't think it's really an issue of correct, but I wouldn't 
>> >> >> >>>> disagree
>> >> >> >>>> to a suggestion that we ought to introduce a new signal type for 
>> >> >> >>>> this
>> >> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq 
>> >> >> >>>> and has a
>> >> >> >>>> similar interface as qemu_irq.
>> >> >> >>> A separate type would complicate the delivery of the feedback value
>> >> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >> >> >>>
>> >> >> >>>>> I think the real solution to coalescing is put the logic inside 
>> >> >> >>>>> one
>> >> >> >>>>> device, in this case APIC because it has the information about 
>> >> >> >>>>> irq
>> >> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> >> >>>>> information and whether they get delivered or not. If not, an 
>> >> >> >>>>> internal
>> >> >> >>>>> timer is installed which injects the lost irqs.
>> >> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >> >> >>> interrupt controller about the fact that a particular IRQ should be
>> >> >> >>> delivered at a specific regular rate. For this, you also need a 
>> >> >> >>> generic
>> >> >> >>> interface - nothing really "won".
>> >> >> >>
>> >> >> >> OK, let's simplify: just reinject at next possible chance. No need 
>> >> >> >> to
>> >> >> >> monitor or tell anything.
>> >> >> >
>> >> >> > There are guests that won't like this (I know of one in-house, but
>> >> >> > others may even have more examples), specifically if you end up 
>> >> >> > firing
>> >> >> > multiple IRQs in a row due to a longer backlog. For that reason, the 
>> >> >> > RTC
>> >> >> > spreads the reinjection according to the current rate.
>> >> >>
>> >> >> Then reinject with a constant delay, or next CPU exit. Such buggy
>> >> > If guest's time frequency is the same as host time frequency you can't
>> >> > reinject with constant delay. That is why current code mixes two
>> >> > approaches: reinject M interrupts in a raw then delay.
>> >>
>> >> This approach can be also used by APIC-only version.
>> >>
>> > I don't know what APIC-only version you are talking about. I haven't
>> > seen the code and I don't understand hand waving, sorry.
>>
>> There is no code, because we're still at architecture design stage.
>>
> Try to write test code to understand the problem better.


I will.

>> >> >> guests could also be assisted with special handling (like win2k
>> >> >> install hack), for example guest instructions could be counted
>> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> after at least N instructions have passed.
>> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> > or less reliably using perf counters, may be).
>> >>
>> >> Aren't there any debug registers or perf counters, which can generate
>> >> an interrupt after some number of instructions have been executed?
>> > Don't think debug registers have something like that and they are
>> > available for guest use anyway. Perf counters differs greatly from CPU
>> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> > keep using them for profiling guests. And I don't see what problem it
>> > will solve anyway that can be solved by simple delay between irq
>> > reinjection.
>>
>> This would allow counting the executed instructions and limit it. Thus
>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>
> Why would you want to limit number of instruction executed by guest if
> CPU has nothing else to do anyway? The problem occurs not when we have
> spare cycles so give to a guest, but in opposite case.

I think one problem is that the guest has executed too much compared
to what would happen with real HW with a lesser CPU. That explains the
RTC frequency reprogramming case.

>
>> >>
>> >> >>
>> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> > about the fact that an IRQ is really periodic and does not only 
>> >> >> > appear
>> >> >> > as such for a certain interval. This really does not sound like
>> >> >> > simplifying things or even make them cleaner.
>> >> >>
>> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> would probably eventually spread to many other devices. The logical
>> >> >> conclusion of that would be a system where all devices would be
>> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> trigger a bug.
>> >> >>
>> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> > pale in comparison (and I still don't see how you are going to make it
>> >> > actually work).
>> >>
>> >> Implement everything inside APIC: only coalescing and reinjection.
>> > APIC has zero info needed to implement reinjection correctly as was
>> > shown to you several time in this thread and you simply keep ignoring
>> > it.
>>
>> On the contrary, APIC is actually the only source of the IRQ ack
>> information. RTC hack would not work without APIC (or the
>> bidirectional IRQ) passing this info to RTC.
>>
>> What APIC doesn't have now is the timer frequency or period info. This
>> is known by RTC and also higher levels managing the clocks.
>>
> So APIC has one bit of information and RTC everything else.

The information known by RTC (timer period) is also known by higher levels.

> The current
> approach (and proposed patch) brings this one bit of information to RTC,
> you are arguing that RTC should be able to communicate all its info to
> APIC. Sorry I don't see that your way has any advantage. Just more
> complex interface and it is much easier to get it wrong for other time
> sources.

I don't think anymore that APIC should be handling this but the
generic stuff, like vl.c or exec.c. Then there would be only
information passing from APIC to higher levels.

>> I keep ignoring the idea that the current model, where both RTC and
>> APIC must somehow work together to make coalescing work, is the only
>> possible just because it is committed and it happens to work in some
>> cases. It would be much better to concentrate this to one place, APIC
>> or preferably higher level where it may benefit other timers too.
>> Provided of course that the other models can be made to work.
>>
> So write the code and show us. You haven't show any evidence that RTC is
> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> know when clock frequency changes, it know when device reset happened.
> APIC knows only that interrupt was coalesced. It doesn't even know that
> it may be masked by a guest in IOAPIC (interrupts delivered while they
> are masked not considered coalesced).

Oh, I thought interrupt masking was the reason for coalescing! What
exactly is the reason then?

> Time source knows only when
> frequency changes and may be when device reset happens if timer is
> stopped by device on reset. So RTC is actually a sweet spot if you want
> to minimize amount of info you need to pass between various layers.
>
>> >> Maybe that version would not bend backwards as much as the current to
>> >> cater for buggy hosts.
>> >>
>> > You mean "buggy guests"?
>>
>> Yes, sorry.
>>
>> > What guests are not buggy in your opinion?
>> > Linux tries hard to be smart and as a result the only way to have stable
>> > clock with it is to go paravirt.
>>
>> I'm not an OS designer, but I think an OS should never crash, even if
>> a burst of IRQs is received. Reprogramming the timer should consider
>> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> cause of the problem.
> OS should never crash in the absence of HW bugs? I doubt you can design
> an OS that can run in a face of any HW failure. Anyway here we are
> trying to solve guests time keeping problem not crashes. Do you think
> you can design OS that can keep time accurately no matter how crazy all
> HW clock behaves?

I think my OS design skills are not relevant in this discussion, but
IIRC there are fault tolerant operating systems for extreme conditions
so it can be done.

>
>>
>> >> > The fact is that timer device is not "just like any
>> >> > other device" in virtual world. Any other device is easy: you just
>> >> > implement spec as close as possible and everything works. For time
>> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> > letter and your guest will drift like crazy.
>> >>
>> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> without any voodoo. The interrupts would come after executing the same
>> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> guests in any sufficiently desperate low resource conditions, this may
>> >> be the only option that will always work.
>> >>
>> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> > time, so any emulation done between two instruction is theoretically
>> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> > keeping all time source in perfect sync is also impossible.
>>
>> That is actually another cause of the problem. KVM gives the guest an
>> illusion that the VCPU speed is equal to host speed. When they don't
>> match, especially in critical code, there can be problems. It would be
>> better to tell the guest a lower speed, which also can be guaranteed.
>>
> Not possible. It's that simple. You should take it into account in your
> architecture design stage. In case of KVM real physical CPU executes guest
> instruction and it does this as fast as it can. The only way we can hide
> that from a guest is by intercepting each access to TSC and at that
> point we can use bochs instead.

Well, as Paul pointed out, there's also icount option.

>> Maybe we should also offline the device emulation to another host CPU
>> with threading. A load from a device will always be much slower than
>> on real HW though.
> Time drift problem start to happen on loaded servers, so you do not have
> spare CPU to offload device emulation too.
>
> --
>                        Gleb.
>

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

Reply via email to