On Sun, Jul 6, 2008 at 10:56 PM, Anthony Liguori <[EMAIL PROTECTED]> wrote:
> Glauber Costa wrote:
>>
>> Marcelo Tosatti wrote:
>>>
>>> Hello,
>>>
>>> I have been discussing with Glauber and Gerd the problem where KVM
>>> guests miscalibrate loops_per_jiffy if there's sufficient load on the
>>> host.
>>>
>>> calibrate_delay_direct() failed to get a good estimate for
>>> loops_per_jiffy.
>>> Probably due to long platform interrupts. Consider using "lpj=" boot
>>> option.
>>> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
>>>
>>> While this particular host calculates lpj=1597041.
>>>
>>> This means that udelay() can delay for less than what asked for, with
>>> fatal results such as:
>>>
>>> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
>>> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
>>> 'noapic' kernel parameter
>>>
>>> This bug is easily triggered with a CPU hungry task on nice -20
>>> running only during guest calibration (so that the timer check code on
>>> io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).
>>>
>>> The problem is that the calibration routines assume a stable relation
>>> between timer interrupt frequency (PIT at this boot stage) and
>>> TSC/execution frequency.
>>>
>>> The emulated timer frequency is based on the host system time and
>>> therefore virtually resistant against heavy load, while the execution
>>> of these routines on the guest is suspectible to scheduling of the QEMU
>>> process.
>>>
>>> To fix this in a transparent way (without direct "lpj=" boot parameter
>>> assignment or a paravirt equivalent), it would be necessary to base the
>>> emulated timer frequency on guest execution time instead of host system
>>> time. But this can introduce timekeeping issues (recent Linux guests
>>> seem to handle lost/late interrupts fine as long as the clocksource is
>>> reliable) and just sounds scary.
>>>
>>> Possible solutions:
>>>
>>> - Require the admin to preset "lpj=". Nasty, not user friendly.
>>> - Pass the proper lpj value via a paravirt interface. Won't cover
>>> fullvirt guests.
>>> - Have the management app guarantee a minimum amount of CPU required
>>> for proper calibration during guest initialization.
>>
>> I don't like any of these solutions, and won't defend any of "the one". So
>> no hard feelings. But I think the "less worse" among them IMHO is the
>> paravirt one. At least it goes in the general direction of "paravirt if
>> you need to scale over xyz".
>
> I agree. A paravirt solution solves the problem.Please, look at the patch I've attached. It does __delay with host help. This may have the nice effect of not busy waiting for long-enough delays, and may well. It is _completely_ PoC, just to show the idea. It's ugly, broken, obviously have to go through pv-ops, etc. Also, I intend to add a lpj field in the kvm clock memory area. We could do just this later, do both, etc. If we agree this is a viable solution, I'll start working on a patch >> I think passing lpj is out of question, and giving the cpu resources for >> that time is kind of a kludge. > > It's all heuristics unfortunately. > >> Or maybe we could put the timer expiration alone in a separate thread, >> with maximum priority (maybe rt priority)? dunno... > > But then if you have high-load because of a lot of guests running, you > defeat yourself. Any attempt to guarantee time to a guest will be defeated > by lots of guests all attempting calibration at the same time. > > Regards, > > Anthony Liguori > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Glauber Costa. "Free as in Freedom" http://glommer.net "The less confident you are, the more serious you have to act."
proposal.patch
Description: Binary data
