2010/5/28 Gleb Natapov <g...@redhat.com>: > On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote: >> 2010/5/28 Gleb Natapov <g...@redhat.com>: >> > On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote: >> >> 2010/5/27 Gleb Natapov <g...@redhat.com>: >> >> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote: >> >> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kis...@web.de> wrote: >> >> >> > Blue Swirl wrote: >> >> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kis...@web.de> >> >> >> >> wrote: >> >> >> >>> Anthony Liguori wrote: >> >> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote: >> >> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kis...@web.de> >> >> >> >>>>> wrote: >> >> >> >>>>> >> >> >> >>>>>> From: Jan Kiszka<jan.kis...@siemens.com> >> >> >> >>>>>> >> >> >> >>>>>> This allows to communicate potential IRQ coalescing during >> >> >> >>>>>> delivery from >> >> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing >> >> >> >>>>>> workarounds need to register handlers that return the >> >> >> >>>>>> appropriate >> >> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across >> >> >> >>>>>> all IRQ >> >> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, >> >> >> >>>>>> it can >> >> >> >>>>>> apply its workaround. If multiple sinks exist, the source may >> >> >> >>>>>> only >> >> >> >>>>>> consider an IRQ coalesced if all other sinks either report >> >> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED. >> >> >> >>>>>> >> >> >> >>>>> No real devices are interested whether any of their output lines >> >> >> >>>>> are >> >> >> >>>>> even connected. This would introduce a new signal type, >> >> >> >>>>> bidirectional >> >> >> >>>>> multi-level, which is not correct. >> >> >> >>>>> >> >> >> >>>> I don't think it's really an issue of correct, but I wouldn't >> >> >> >>>> disagree >> >> >> >>>> to a suggestion that we ought to introduce a new signal type for >> >> >> >>>> this >> >> >> >>>> type of bidirectional feedback. Maybe it's qemu_coalesced_irq >> >> >> >>>> and has a >> >> >> >>>> similar interface as qemu_irq. >> >> >> >>> A separate type would complicate the delivery of the feedback value >> >> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing). >> >> >> >>> >> >> >> >>>>> I think the real solution to coalescing is put the logic inside >> >> >> >>>>> one >> >> >> >>>>> device, in this case APIC because it has the information about >> >> >> >>>>> irq >> >> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency >> >> >> >>>>> information and whether they get delivered or not. If not, an >> >> >> >>>>> internal >> >> >> >>>>> timer is installed which injects the lost irqs. >> >> >> >>> That won't fly as the IRQs will already arrive at the APIC with a >> >> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the >> >> >> >>> interrupt controller about the fact that a particular IRQ should be >> >> >> >>> delivered at a specific regular rate. For this, you also need a >> >> >> >>> generic >> >> >> >>> interface - nothing really "won". >> >> >> >> >> >> >> >> OK, let's simplify: just reinject at next possible chance. No need >> >> >> >> to >> >> >> >> monitor or tell anything. >> >> >> > >> >> >> > There are guests that won't like this (I know of one in-house, but >> >> >> > others may even have more examples), specifically if you end up >> >> >> > firing >> >> >> > multiple IRQs in a row due to a longer backlog. For that reason, the >> >> >> > RTC >> >> >> > spreads the reinjection according to the current rate. >> >> >> >> >> >> Then reinject with a constant delay, or next CPU exit. Such buggy >> >> > If guest's time frequency is the same as host time frequency you can't >> >> > reinject with constant delay. That is why current code mixes two >> >> > approaches: reinject M interrupts in a raw then delay. >> >> >> >> This approach can be also used by APIC-only version. >> >> >> > I don't know what APIC-only version you are talking about. I haven't >> > seen the code and I don't understand hand waving, sorry. >> >> There is no code, because we're still at architecture design stage. >> > Try to write test code to understand the problem better.
I will. >> >> >> guests could also be assisted with special handling (like win2k >> >> >> install hack), for example guest instructions could be counted >> >> >> (approximately, for example using TB size or TSC) and only inject >> >> >> after at least N instructions have passed. >> >> > Guest instructions cannot be easily counted in KVM (it can be done more >> >> > or less reliably using perf counters, may be). >> >> >> >> Aren't there any debug registers or perf counters, which can generate >> >> an interrupt after some number of instructions have been executed? >> > Don't think debug registers have something like that and they are >> > available for guest use anyway. Perf counters differs greatly from CPU >> > to CPU (even between two CPUs of the same manufacturer), and we want to >> > keep using them for profiling guests. And I don't see what problem it >> > will solve anyway that can be solved by simple delay between irq >> > reinjection. >> >> This would allow counting the executed instructions and limit it. Thus >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately. >> > Why would you want to limit number of instruction executed by guest if > CPU has nothing else to do anyway? The problem occurs not when we have > spare cycles so give to a guest, but in opposite case. I think one problem is that the guest has executed too much compared to what would happen with real HW with a lesser CPU. That explains the RTC frequency reprogramming case. > >> >> >> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now >> >> >> > about the fact that an IRQ is really periodic and does not only >> >> >> > appear >> >> >> > as such for a certain interval. This really does not sound like >> >> >> > simplifying things or even make them cleaner. >> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be >> >> >> just like any other device. With the bidirectional irqs, this voodoo >> >> >> would probably eventually spread to many other devices. The logical >> >> >> conclusion of that would be a system where all devices would be >> >> >> careful not to disturb the guest at wrong moment because that would >> >> >> trigger a bug. >> >> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack >> >> > pale in comparison (and I still don't see how you are going to make it >> >> > actually work). >> >> >> >> Implement everything inside APIC: only coalescing and reinjection. >> > APIC has zero info needed to implement reinjection correctly as was >> > shown to you several time in this thread and you simply keep ignoring >> > it. >> >> On the contrary, APIC is actually the only source of the IRQ ack >> information. RTC hack would not work without APIC (or the >> bidirectional IRQ) passing this info to RTC. >> >> What APIC doesn't have now is the timer frequency or period info. This >> is known by RTC and also higher levels managing the clocks. >> > So APIC has one bit of information and RTC everything else. The information known by RTC (timer period) is also known by higher levels. > The current > approach (and proposed patch) brings this one bit of information to RTC, > you are arguing that RTC should be able to communicate all its info to > APIC. Sorry I don't see that your way has any advantage. Just more > complex interface and it is much easier to get it wrong for other time > sources. I don't think anymore that APIC should be handling this but the generic stuff, like vl.c or exec.c. Then there would be only information passing from APIC to higher levels. >> I keep ignoring the idea that the current model, where both RTC and >> APIC must somehow work together to make coalescing work, is the only >> possible just because it is committed and it happens to work in some >> cases. It would be much better to concentrate this to one place, APIC >> or preferably higher level where it may benefit other timers too. >> Provided of course that the other models can be made to work. >> > So write the code and show us. You haven't show any evidence that RTC is > the wrong place. RTC knows when interrupt was acknowledge to RTC, it > know when clock frequency changes, it know when device reset happened. > APIC knows only that interrupt was coalesced. It doesn't even know that > it may be masked by a guest in IOAPIC (interrupts delivered while they > are masked not considered coalesced). Oh, I thought interrupt masking was the reason for coalescing! What exactly is the reason then? > Time source knows only when > frequency changes and may be when device reset happens if timer is > stopped by device on reset. So RTC is actually a sweet spot if you want > to minimize amount of info you need to pass between various layers. > >> >> Maybe that version would not bend backwards as much as the current to >> >> cater for buggy hosts. >> >> >> > You mean "buggy guests"? >> >> Yes, sorry. >> >> > What guests are not buggy in your opinion? >> > Linux tries hard to be smart and as a result the only way to have stable >> > clock with it is to go paravirt. >> >> I'm not an OS designer, but I think an OS should never crash, even if >> a burst of IRQs is received. Reprogramming the timer should consider >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one >> cause of the problem. > OS should never crash in the absence of HW bugs? I doubt you can design > an OS that can run in a face of any HW failure. Anyway here we are > trying to solve guests time keeping problem not crashes. Do you think > you can design OS that can keep time accurately no matter how crazy all > HW clock behaves? I think my OS design skills are not relevant in this discussion, but IIRC there are fault tolerant operating systems for extreme conditions so it can be done. > >> >> >> > The fact is that timer device is not "just like any >> >> > other device" in virtual world. Any other device is easy: you just >> >> > implement spec as close as possible and everything works. For time >> >> > source device this is not enough. You can implement RTC+HPET to the >> >> > letter and your guest will drift like crazy. >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift, >> >> without any voodoo. The interrupts would come after executing the same >> >> instruction as the real HW. For emulating any sufficiently buggy >> >> guests in any sufficiently desperate low resource conditions, this may >> >> be the only option that will always work. >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive >> > to be one. On the contrary KVM runs at native host CPU speed most of the >> > time, so any emulation done between two instruction is theoretically >> > noticeable for a guest. TSC is bypassed directly to a guest too, so >> > keeping all time source in perfect sync is also impossible. >> >> That is actually another cause of the problem. KVM gives the guest an >> illusion that the VCPU speed is equal to host speed. When they don't >> match, especially in critical code, there can be problems. It would be >> better to tell the guest a lower speed, which also can be guaranteed. >> > Not possible. It's that simple. You should take it into account in your > architecture design stage. In case of KVM real physical CPU executes guest > instruction and it does this as fast as it can. The only way we can hide > that from a guest is by intercepting each access to TSC and at that > point we can use bochs instead. Well, as Paul pointed out, there's also icount option. >> Maybe we should also offline the device emulation to another host CPU >> with threading. A load from a device will always be much slower than >> on real HW though. > Time drift problem start to happen on loaded servers, so you do not have > spare CPU to offload device emulation too. > > -- > Gleb. >