Avi Kivity wrote: > [copying Thomas for a question about CONSTANT_TSC, below] > > Yang, Sheng wrote: >> I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up >> issue. The problem was caused by >> kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f >> "KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me >> much time to found the solution, but a lot of time to find the proper >> explanation... :( ) >> >> > > Thanks for tackling this difficult issue. Many have tried and failed, > looks like you finally nailed it :) > > >> As we guessed, the problem was the monotonous of TSC. I have traced to >> the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in >> the >> loop of function update_wall_timer()(kernel/timer.c), when using TSC as >> clocksource by default. >> >> The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is >> monotonous" bring big gap between different VCPUs (error between >> TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous >> on each VCPU (which rejected my first thought...), the patch >> have 2 problems: >> >> 1. It have accumulated the error. Each vcpu's TSC is monotonous, but get >> slower and slower, compared to the host. That's because the TSC is very >> accuracy and the interval between reading TSC is big. But this is not very >> critical. >> >> 2. The critical one. In normal condition, VCPU0 migrated much more >> frequently than other VCPUs. And the patch add more "delta" (always negative >> if host TSC is stable) to TSC_OFFSET each >> time migrated. Then after boot for a while, VCPU0 became much >> slower than others (In my test, VCPU0 was migrated about two times than the >> others, and easily to be more than 100k cycles slower). In the guest kernel, >> clocksource TSC is global variable, the variable "cycle_last" may got the >> VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is >> smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1) >> bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset = >> clocksource_read() - cycle_last" overflowed and caused the "infinite" loop. >> And it can also explained why Marcelo's patch don't work - it just reduce >> the >> rate of gap increasing. >> >> The freezing didn't happen when using userspace IOAPIC, just because the >> qemu >> APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery. >> It choose VCPU0 everytime if possible, so CPU1 in guest won't update >> cycle_last. :( >> >> This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of >> they >> set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as >> LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In >> contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't >> have this problem. So does RHEL4(kernel 2.6.9). >> >> I don't know if the patch was still needed now, since it was posted long >> ago(I >> don't know which issue it solved). I'd like to post a revert patch if >> necessary. >> > > I believe the patch is still necessary, since we still need to guarantee > that a vcpu's tsc is monotonous. I think there are three issues to be > addressed: > > 1. The majority of intel machines don't need the offset adjustment since > they already have a constant rate tsc that is synchronized on all cpus. > I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not > 100% certain if it means that the rate is the same for all cpus, Thomas > can you clarify?) > > This will improve tsc quality for those machines, but we can't depend on > it, since some machines don't have constant tsc. Further, I don't think > really large machines can have constant tsc since clock distribution > becomes difficult or impossible. > > 2. We should implement round robin and lowest priority like qemu does. > Xen does the same thing: > >> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */ >> #define IRQ0_SPECIAL_ROUTING 1 > in arch/x86/hvm/vioapic.c, at least for irq 0. > > 3. The extra migrations on vcpu 0 are likely due to its role servicing > I/O on behalf of the entire virtual machine. We should move this extra > work to an independent thread. I have done some work in this area. It > is becoming more important as kvm becomes more scalable. >
will be a new release in the near future? since many of us waiting for this bug to be fixed on quad and other multi core cpus. -- Levente "Si vis pacem para bellum!" ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel