I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up issue. The problem was caused by kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f "KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me much time to found the solution, but a lot of time to find the proper explanation... :( )
As we guessed, the problem was the monotonous of TSC. I have traced to the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the loop of function update_wall_timer()(kernel/timer.c), when using TSC as clocksource by default. The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is monotonous" bring big gap between different VCPUs (error between TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous on each VCPU (which rejected my first thought...), the patch have 2 problems: 1. It have accumulated the error. Each vcpu's TSC is monotonous, but get slower and slower, compared to the host. That's because the TSC is very accuracy and the interval between reading TSC is big. But this is not very critical. 2. The critical one. In normal condition, VCPU0 migrated much more frequently than other VCPUs. And the patch add more "delta" (always negative if host TSC is stable) to TSC_OFFSET each time migrated. Then after boot for a while, VCPU0 became much slower than others (In my test, VCPU0 was migrated about two times than the others, and easily to be more than 100k cycles slower). In the guest kernel, clocksource TSC is global variable, the variable "cycle_last" may got the VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1) bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset = clocksource_read() - cycle_last" overflowed and caused the "infinite" loop. And it can also explained why Marcelo's patch don't work - it just reduce the rate of gap increasing. The freezing didn't happen when using userspace IOAPIC, just because the qemu APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery. It choose VCPU0 everytime if possible, so CPU1 in guest won't update cycle_last. :( This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't have this problem. So does RHEL4(kernel 2.6.9). I don't know if the patch was still needed now, since it was posted long ago(I don't know which issue it solved). I'd like to post a revert patch if necessary. -- Thanks Yang, Sheng ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel