I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up 
issue. The problem was caused by 
kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f
"KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me 
much time to found the solution, but a lot of time to find the proper 
explanation...  :( )

As we guessed, the problem was the monotonous of TSC. I have traced to 
the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the 
loop of function update_wall_timer()(kernel/timer.c), when using TSC as 
clocksource by default.

The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is 
monotonous" bring big gap between different VCPUs (error between 
TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous 
on each VCPU (which rejected my first thought...), the patch 
have 2 problems:

1. It have accumulated the error. Each vcpu's TSC is monotonous, but get 
slower and slower, compared to the host. That's because the TSC is very 
accuracy and the interval between reading TSC is big. But this is not very 
critical.

2. The critical one. In normal condition, VCPU0 migrated much more 
frequently than other VCPUs. And the patch add more "delta" (always negative 
if host TSC is stable) to TSC_OFFSET each 
time migrated. Then after boot for a while, VCPU0 became much 
slower than others (In my test, VCPU0 was migrated about two times than the
others, and easily to be more than 100k cycles slower). In the guest kernel, 
clocksource TSC is global variable, the variable "cycle_last" may got the 
VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is 
smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1) 
bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset = 
clocksource_read() - cycle_last" overflowed and caused the "infinite" loop. 
And it can also explained why Marcelo's patch don't work - it just reduce the 
rate of gap increasing.

The freezing didn't happen when using userspace IOAPIC, just because the qemu 
APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
It choose VCPU0 everytime if possible, so CPU1 in guest won't update 
cycle_last. :( 

This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they 
set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as 
LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In 
contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't 
have this problem. So does RHEL4(kernel 2.6.9). 

I don't know if the patch was still needed now, since it was posted long ago(I 
don't know which issue it solved). I'd like to post a revert patch if 
necessary.

-- 
Thanks
Yang, Sheng

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to