On Sun, Jul 6, 2008 at 10:56 PM, Anthony Liguori <[EMAIL PROTECTED]> wrote:
> Glauber Costa wrote:
>>
>> Marcelo Tosatti wrote:
>>>
>>> Hello,
>>>
>>> I have been discussing with Glauber and Gerd the problem where KVM
>>> guests miscalibrate loops_per_jiffy if there's sufficient load on the
>>> host.
>>>
>>> calibrate_delay_direct() failed to get a good estimate for
>>> loops_per_jiffy.
>>> Probably due to long platform interrupts. Consider using "lpj=" boot
>>> option.
>>> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
>>>
>>> While this particular host calculates lpj=1597041.
>>>
>>> This means that udelay() can delay for less than what asked for, with
>>> fatal results such as:
>>>
>>> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
>>> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
>>> 'noapic' kernel parameter
>>>
>>> This bug is easily triggered with a CPU hungry task on nice -20
>>> running only during guest calibration (so that the timer check code on
>>> io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).
>>>
>>> The problem is that the calibration routines assume a stable relation
>>> between timer interrupt frequency (PIT at this boot stage) and
>>> TSC/execution frequency.
>>>
>>> The emulated timer frequency is based on the host system time and
>>> therefore virtually resistant against heavy load, while the execution
>>> of these routines on the guest is suspectible to scheduling of the QEMU
>>> process.
>>>
>>> To fix this in a transparent way (without direct "lpj=" boot parameter
>>> assignment or a paravirt equivalent), it would be necessary to base the
>>> emulated timer frequency on guest execution time instead of host system
>>> time. But this can introduce timekeeping issues (recent Linux guests
>>> seem to handle lost/late interrupts fine as long as the clocksource is
>>> reliable) and just sounds scary.
>>>
>>> Possible solutions:
>>>
>>> - Require the admin to preset "lpj=". Nasty, not user friendly.
>>> - Pass the proper lpj value via a paravirt interface. Won't cover
>>>  fullvirt guests.
>>> - Have the management app guarantee a minimum amount of CPU required
>>> for proper calibration during guest initialization.
>>
>> I don't like any of these solutions, and won't defend any of "the one". So
>> no hard feelings. But I think the "less worse" among them IMHO is the
>> paravirt one. At least it goes in the general direction of "paravirt if
>> you need to scale over xyz".
>
> I agree.  A paravirt solution solves the problem.

Please, look at the patch I've attached.

It does  __delay with host help. This may have the nice effect of not
busy waiting for long-enough delays, and may well.

It is _completely_ PoC, just to show the idea. It's ugly, broken,
obviously have to go through pv-ops, etc.

Also, I intend to add a lpj field in the kvm clock memory area. We
could do just this later, do both, etc.

If we agree this is a viable solution, I'll start working on a patch

>> I think passing lpj is out of question, and giving the cpu resources for
>> that time is kind of a kludge.
>
> It's all heuristics unfortunately.
>
>> Or maybe we could put the timer expiration alone in a separate thread,
>> with maximum priority (maybe rt priority)? dunno...
>
> But then if you have high-load because of a lot of guests running, you
> defeat yourself.  Any attempt to guarantee time to a guest will be defeated
> by lots of guests all attempting calibration at the same time.
>
> Regards,
>
> Anthony Liguori
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Glauber Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Attachment: proposal.patch
Description: Binary data

Reply via email to