Andrea Arcangeli wrote: > Hello, > > [ I already sent it once as [EMAIL PROTECTED] but it didn't go through > for whatever reason, trying again from private email, hope there > won't be dups ] > oh, it was sent to the list, dont trust (in case you did) the source forge site for the mails inside this list, gmane is much better... > My worst longstanding problem with KVM is that as the uptime of my > host system increased, my opensuse guest images started to destabilize > and lockup at boot. The weird thing was that fresh after boot > everything was always perfectly ok, so I thought it was rmmod/insmod > or some other sticky effect on the CPU after restarting the guest a > few times that triggered the crash. Furthermore if I loaded the cpu a > lot (like with a while :; do true;done), the crash would magically > disappear. Decreasing cpu frequency and timings didn't help. Debugging > wasn't trivial because it required a certain uptime and it didn't > always crash. > > So I once debugged this more aggressively I figured out KVM was ok, it > was the guest that crashed in the tsc clocksource because tsc wasn't > monotone. guest was looping in an infinite loop with irq disabled. So > I tried to pass "notsc" and that fixed the crash just fine. > > Initially I thought it was the tsc_offset logic being wrong but then I > figured out that the vcpu_put/load wasn't always executed, this > bugcheck triggers with current git and so I recommend to apply this to > kvm.git to avoid similar nasty hard-to-detect bugs in the future (Avi > says vmx would crash hard in such a condition, svm is much simpler and > it somewhat survives the lack of sched_in and only crashes the guest > due to not monotone tsc): > > Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ac876ec..26372fa 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -742,6 +742,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > > void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > { > + WARN_ON(vcpu->cpu != smp_processor_id()); > kvm_x86_ops->vcpu_put(vcpu); > kvm_put_guest_fpu(vcpu); > } > > > > So trying to understand why the ->cpu was wrong, I looked into the > preempt notifiers emulation, and it looked quite fragile without a > real sched_in hook. I figured out I could provide a real sched_in hook > by loading the proper values in the > tsk->thread.debugreg[0/7]. Initially I got the hooking points out of > objdump -d vmlinux, but Avi preferred no dependency on the vmlinux and > he suggested to try to find the sched_in hook in the stack. So that's > what I implemented now and this should provide real robustness to the > out of tree module compiled against binary kernel images with > CONFIG_KVM=n. I tried to be compatible with all kernels down to 2.6.5 > but only 2.6.2x host is tested and only on 64bit and only on SVM (no > vmx system around here at all). > > This fixes my longstanding KVM instability and "-smp 2" now works > flawlessy with svm too! -smp 2 -snapshot crashes in qemu userland but > that's not kernel related, must be some thread mutex lock recursion or > lock inversion in the qcow cow code. Removing -snapshot make -smp 2 > stable. Multiple guests UP and SMP seems stable too. > you mean that without -snapshot, the userspace not hang at the sigwait() in the qcow code? > To reproduce my crash easily without waiting ages for the two tsc to > deviate with an error larger than the number of cycles it takes for a > CPU migration, run write_tsc(0,0) in kernel mode (like in the svm.c > init function and then insmod kvm-amd; rmmod kvm-amd and then remove > write_tsc and recompile kvm-amd). > >
------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel