Another tidbit for you guys as I make my way through various permutations: I installed the RHEL3 hugemem kernel and the guest behavior is *much* better. System time still has some regular hiccups that are higher than xen and esx (e.g., 1 minute samples out of 5 show system time between 10 and 15%), but overall guest behavior is good with the hugemem kernel.
One side effect I've noticed is that I cannot restart the RHEL3 guest running the hugemem kernel in successive attempts. The guest has 2 vcpus and qemu shows one thread at 100% cpu. If I recall correctly kvm_stat shows a large amount of tlb_flushes (like millions in a 5-second sample). The scenario is: 1. start guest running hugemem kernel, 2. shutdown, 3. restart guest. During 3. it hangs, but at random points. Removing kvm/kvm-intel has no effect - guest still hangs on the restart. Rebooting the host clears the problem. Alternatively, during the hang on a restart I can kill the guest, and then on restart choose the normal, 32-bit smp kernel and the guest boots just fine. At this point I can shutdown the guest and restart with the hugemem kernel and it boots just fine. david David S. Ahern wrote: > Hi Marcelo: > > mmu_recycled is always 0 for this guest -- even after almost 4 hours of > uptime. > > Here is a kvm_stat sample where guest time was very high and qemu had 2 > processors at 100% on the host. I removed counters where both columns have 0 > value for brevity. > > exits 45937979 758051 > fpu_reload 1416831 87 > halt_exits 112911 0 > halt_wakeup 31771 0 > host_state_reload 2068602 263 > insn_emulation 21601480 365493 > io_exits 1827374 2705 > irq_exits 8934818 285196 > mmio_exits 421674 147 > mmu_cache_miss 4817689 93680 > mmu_flooded 4815273 93680 > mmu_pde_zapped 51344 0 > mmu_prefetch 4817625 93680 > mmu_pte_updated 14803298 270104 > mmu_pte_write 19859863 363785 > mmu_shadow_zapped 4832106 93679 > pf_fixed 32184355 468398 > pf_guest 264138 0 > remote_tlb_flush 10697762 280522 > tlb_flush 10301338 176424 > > (NOTE: This is for a *5* second sample interval instead of 1 to allow me to > capture the data). > > Here's a sample when the guest is "well-behaved" (system time <10%, though ): > exits 51502194 97453 > fpu_reload 1421736 227 > halt_exits 138361 1927 > halt_wakeup 33047 117 > host_state_reload 2110190 3740 > insn_emulation 24367441 47260 > io_exits 1874075 2576 > irq_exits 10224702 13333 > mmio_exits 435154 1726 > mmu_cache_miss 5414097 11258 > mmu_flooded 5411548 11243 > mmu_pde_zapped 52851 44 > mmu_prefetch 5414031 11258 > mmu_pte_updated 16854686 29901 > mmu_pte_write 22526765 42285 > mmu_shadow_zapped 5430025 11313 > pf_fixed 36144578 67666 > pf_guest 282794 430 > remote_tlb_flush 12126268 14619 > tlb_flush 11753162 21460 > > > There is definitely a strong correlation between the mmu counters and high > system times in the guest. I am still trying to find out what in the guest is > stimulating it when running on RHEL3; I do not see this same behavior for an > equivalent setup running on RHEL4. > > By the way I added an mmu_prefetch stat in prefetch_page() to count the number > of times the for() loop is hit with PTTYPE == 64; ie., number of times > paging64_prefetch_page() is invoked. (I wanted an explicit counter for this > loop, though the info seems to duplicate other entries.) That counter is > listed > above. As I mentioned in a prior post when kscand kicks in the change in > mmu_prefetch counter is at 20,000+/sec, with each trip through that function > taking 45k+ cycles. > > kscand is an instigator shortly after boot, however, kscand is *not* the > culprit > once the system has been up for 30-45 minutes. I have started instrumenting > the > RHEL3U8 kernel and for the load I am running kscand does not walk the active > lists very often once the system is up. > > So, to dig deeper on what in the guest is stimulating the mmu I collected > kvmtrace data for about a 2 minute time interval which caught about a > 30-second > period where guest system time was steady in the 25-30% range. Summarizing the > number of times a RIP appears in an VMEXIT shows the following high runners: > > count RIP RHEL3-symbol > 82549 0xc0140e42 follow_page [kernel] c0140d90 offset b2 > 42532 0xc0144760 handle_mm_fault [kernel] c01446d0 offset 90 > 36826 0xc013da4a futex_wait [kernel] c013d870 offset 1da > 29987 0xc0145cd0 zap_pte_range [kernel] c0145c10 offset c0 > 27451 0xc0144018 do_no_page [kernel] c0143e20 offset 1f8 > > (halt entry removed the list since that is the ideal scenario for an exit). > > So the RIP correlates to follow_page() for a large percentage of the VMEXITs. > > I wrote an awk script to summarize (histogram style) the TSC cycles between > VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times > (ie., almost 100% of the time) the trace shows a delta between 50k and 100k > cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second > one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace > shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These > seems to correlate with the prefetch_page function in kvm, though I am not > 100% > positive on that. > > I am now investigating the kernel paths leading to those functions. Any > insights > would definitely be appreciated. > > david > > > Marcelo Tosatti wrote: >> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote: >>> Most of the cycles (~80% of that 54k+) are spent in >>> paging64_prefetch_page(): >>> >>> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { >>> gpa_t pte_gpa = gfn_to_gpa(sp->gfn); >>> pte_gpa += (i+offset) * sizeof(pt_element_t); >>> >>> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, >>> sizeof(pt_element_t)); >>> if (r || is_present_pte(pt)) >>> sp->spt[i] = shadow_trap_nonpresent_pte; >>> else >>> sp->spt[i] = shadow_notrap_nonpresent_pte; >>> } >>> >>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles >>> per >>> loop. >>> >>> This function gets run >20,000/sec during some of the kscand loops. >> Hi David, >> >> Do you see the mmu_recycled counter increase? >> > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel