Glauber Costa wrote:
Marcelo Tosatti wrote:
Hello,
I have been discussing with Glauber and Gerd the problem where KVM
guests miscalibrate loops_per_jiffy if there's sufficient load on the
host.
calibrate_delay_direct() failed to get a good estimate for
loops_per_jiffy.
Probably due to long platform interrupts. Consider using "lpj=" boot
option.
Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
While this particular host calculates lpj=1597041.
This means that udelay() can delay for less than what asked for, with
fatal results such as:
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
'noapic' kernel parameter
This bug is easily triggered with a CPU hungry task on nice -20
running only during guest calibration (so that the timer check code on
io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).
The problem is that the calibration routines assume a stable relation
between timer interrupt frequency (PIT at this boot stage) and
TSC/execution frequency.
The emulated timer frequency is based on the host system time and
therefore virtually resistant against heavy load, while the execution
of these routines on the guest is suspectible to scheduling of the QEMU
process.
To fix this in a transparent way (without direct "lpj=" boot parameter
assignment or a paravirt equivalent), it would be necessary to base the
emulated timer frequency on guest execution time instead of host system
time. But this can introduce timekeeping issues (recent Linux guests
seem to handle lost/late interrupts fine as long as the clocksource is
reliable) and just sounds scary.
Possible solutions:
- Require the admin to preset "lpj=". Nasty, not user friendly.
- Pass the proper lpj value via a paravirt interface. Won't cover
fullvirt guests.
- Have the management app guarantee a minimum amount of CPU required
for proper calibration during guest initialization.
I don't like any of these solutions, and won't defend any of "the
one". So no hard feelings. But I think the "less worse" among them
IMHO is the
paravirt one. At least it goes in the general direction of "paravirt
if you need to scale over xyz".
I agree. A paravirt solution solves the problem.
I think passing lpj is out of question, and giving the cpu resources
for that time is kind of a kludge.
It's all heuristics unfortunately.
Or maybe we could put the timer expiration alone in a separate thread,
with maximum priority (maybe rt priority)? dunno...
But then if you have high-load because of a lot of guests running, you
defeat yourself. Any attempt to guarantee time to a guest will be
defeated by lots of guests all attempting calibration at the same time.
Regards,
Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html