David S. Ahern wrote: > I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles: > > 1. before vcpu->arch.mmu.page_fault() > 2. after vcpu->arch.mmu.page_fault() > 3. after mmu_topup_memory_caches() > 4. after emulate_instruction() > > So the delta in the trace reports show: > - cycles required for arch.mmu.page_fault (tracer 2) > - cycles required for mmu_topup_memory_caches(tracer 3) > - cycles required for emulate_instruction() (tracer 4) > > I captured trace data for ~5-seconds during one of the usual events (again > this > time it was due to kscand in the guest). I ran the formatted trace data > through > an awk script to summarize: > > TSC cycles tracer2 tracer3 tracer4 > 0 - 10,000: 295067 213251 115873 > 10,001 - 25,000: 7682 1004 98336 > 25,001 - 50,000: 201 15 36 > 50,001 - 100,000: 100655 0 10 > > 100,000: 117 0 15 > > This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl > 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it > took longer than 50,000 cycles. The page_fault function getting run is > paging64_page_fault. > >
This does look like the fork detector. Once in every four faults, it triggers and the fault becomes slow. 100K floods == 100K page tables == 200GB of virtual memory, which seems excessive. Is this running a forked load like apache, with many processes? How much memory is on the guest, and is there any memory pressure? > mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 > times, > most of them relatively quickly. > b > Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few > host processes could interrupt it. > > david > > > Avi Kivity wrote: > >> David S. Ahern wrote: >> >>> I added the traces and captured data over another apparent lockup of >>> the guest. >>> This seems to be representative of the sequence (pid/vcpu removed). >>> >>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016127c ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+3632) VMENTRY >>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016104a ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>> fffb61c8 ] >>> (+ 54928) VMENTRY >>> >>> >> Can you oprofile the host to see where the 54K cycles are spent? >> >> >>> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610e7 ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >>> 41c5d363 ] >>> (+8432) VMENTRY >>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610ee ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db0 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >>> 00000000 ] >>> (+ 13832) VMENTRY >>> >>> >>> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016127c ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+3712) VMENTRY >>> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016104a ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>> fffb61d0 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 >>> 3d55d047 ] >>> >>> >> This indeed has the accessed bit clear. >> >> >>> (+ 65216) VMENTRY >>> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610e7 ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >>> 3d598363 ] >>> >>> >> This has the accessed bit set and the user bit clear, and the pte >> pointing at the previous pte_write gpa. Looks like a kmap_atomic(). >> >> >>> (+8640) VMENTRY >>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610ee ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db0 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >>> 00000000 ] >>> (+ 14160) VMENTRY >>> >>> I can forward a more complete time snippet if you'd like. vcpu0 + >>> corresponding >>> vcpu1 files have 85000 total lines and compressed the files total ~500k. >>> >>> I did not see the FLOODED trace come out during this sample though I >>> did bump >>> the count from 3 to 4 as you suggested. >>> >>> >>> >>> >> Bumping the count was supposed to remove the flooding... >> >> >>> Correlating rip addresses to the 2.4 kernel: >>> >>> c0160d00-c0161290 = page_referenced >>> >>> It looks like the event is kscand running through the pages. I >>> suspected this >>> some time ago, and tried tweaking the kscand_work_percent sysctl >>> variable. It >>> appeared to lower the peak of the spikes, but maybe I imagined it. I >>> believe >>> lowering that value makes kscand wake up more often but do less work >>> (page >>> scanning) each time it is awakened. >>> >>> >>> >> What does 'top' in the guest show (perhaps sorted by total cpu time >> rather than instantaneous usage)? >> >> What host kernel are you running? How many host cpus? >> >> -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel