On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kis...@web.de> wrote: > Blue Swirl wrote: >> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kis...@web.de> wrote: >>> Blue Swirl wrote: >>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <p...@codesourcery.com> wrote: >>>>>> At the other extreme, would it be possible to make the educated guests >>>>>> aware of the virtualization also in clock aspect: virtio-clock? >>>>> The guest doesn't even need to be aware of virtualization. It just needs >>>>> to be >>>>> able to accommodate the lack of guaranteed realtime behavior. >>>>> >>>>> The fundamental problem here is that some guest operating systems assume >>>>> that >>>>> the hardware provides certain realtime guarantees with respect to >>>>> execution of >>>>> interrupt handlers. In particular they assume that the CPU will always be >>>>> able to complete execution of the timer IRQ handler before the periodic >>>>> timer >>>>> triggers again. In most virtualized environments you have absolutely no >>>>> guarantee of realtime response. >>>>> >>>>> With Linux guests this was solved a long time ago by the introduction of >>>>> tickless kernels. These separate the timekeeping from wakeup events, so >>>>> it >>>>> doesn't matter if several wakeup triggers end up getting merged (either >>>>> at the >>>>> hardware level or via top/bottom half guest IRQ handlers). >>>>> >>>>> >>>>> It's worth mentioning that this problem also occurs on real hardware, >>>>> typically due to lame hardware/drivers which end up masking interrupts or >>>>> otherwise stall the CPU for for long periods of time. >>>>> >>>>> >>>>> The PIT hack attempts to workaround broken guests by adding artificial >>>>> latency >>>>> to the timer event, ensuring that the guest "sees" them all. >>>>> Unfortunately >>>>> guests vary on when it is safe for them to see the next timer event, and >>>>> trying to observe this behavior involves potentially harmful heuristics >>>>> and >>>>> collusion between unrelated devices (e.g. interrupt controller and timer). >>>>> >>>>> In some cases we don't even do that, and just reschedule the event some >>>>> arbitrarily small amount of time later. This assumes the guest to do >>>>> useful >>>>> work in that time. In a single threaded environment this is probably true >>>>> - >>>>> qemu got enough CPU to inject the first interrupt, so will probably >>>>> manage to >>>>> execute some guest code before the end of its timeslice. In an environment >>>>> where interrupt processing/delivery and execution of the guest code >>>>> happen in >>>>> different threads this becomes increasingly likely to fail. >>>> So any voodoo around timer events is doomed to fail in some cases. >>>> What's the amount of hacks what we want then? Is there any generic >>> The aim of this patch is to reduce the amount of existing and upcoming >>> hacks. It may still require some refinements, but I think we haven't >>> found any smarter approach yet that fits existing use cases. >> >> I don't feel we have tried other possibilities hard enough. > > Well, seeing prototypes wouldn't be bad, also to run real load againt > them. But at least I'm currently clueless what to implement.
Perhaps now is then not the time to rush to implement something, but to brainstorm for a clean solution. >> >>>> solution, like slowing down the guest system to the point where we can >>>> guarantee the interrupt rate vs. CPU execution speed? >>> That's generally a non-option in virtualized production environments. >>> Specifically if the guest system lost interrupts due to host >>> overcommitment, you do not want it slow down even further. >> >> I meant that the guest time could be scaled down, for example 2s in >> wall clock time would be presented to the guest as 1s. > > But that is precisely what already happens when the guest loses timer > interrupts. There is no other time source for this kind of guests - > often except for some external events generated by systems which you > don't want to fall behind arbitrarily. > >> Then the amount >> of CPU cycles between timer interrupts would increase and hopefully >> the guest can keep up. If the guest sleeps, time base could be >> accelerated to catch up with wall clock and then set back to 1:1 rate. > > Can't follow you ATM, sorry. What should be slowed down then? And how > precisely? I think vm_clock and everything that depends on vm_clock, also rtc_clock should be tied to vm_clock in this mode, not host_clock. > > Jan > >> >> Slowing down could be triggered by measuring the guest load (for >> example, by checking for presence of halt instructions), if it's close >> to 1, time would be slowed down. If the guest starts to issue halt >> instructions because it's more idle, we can increase speed. >> >> If this approach worked, even APIC could be made ignorant about >> coalescing voodoo so it should be a major cleanup. > > >