On Tue, 2009-09-01 at 21:23 +0300, Avi Kivity wrote: > On 09/01/2009 09:12 PM, Andrew Theurer wrote: > > Here's a run from branch debugreg with thread debugreg storage + > > conditionally reload dr6: > > > > user nice system irq softirq guest idle iowait > > 5.79 0.00 9.28 0.08 1.00 20.81 58.78 4.26 > > total busy: 36.97 > > > > Previous run that had avoided calling adjust_vmx_controls twice: > > > > user nice system irq softirq guest idle iowait > > 5.81 0.00 9.48 0.08 1.04 21.32 57.86 4.41 > > total busy: 37.73 > > > > A relative reduction CPU cycles of 2% > > > > That was an wasy fruit to pick. To bad it was a regression that we > introduced. > > > new oprofile: > > > > > >> samples % app name symbol name > >> 876648 54.1555 kvm-intel.ko vmx_vcpu_run > >> 37595 2.3225 qemu-system-x86_64 cpu_physical_memory_rw > >> 35623 2.2006 qemu-system-x86_64 phys_page_find_alloc > >> 24874 1.5366 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> native_write_msr_safe > >> 17710 1.0940 libc-2.5.so memcpy > >> 14664 0.9059 kvm.ko kvm_arch_vcpu_ioctl_run > >> 14577 0.9005 qemu-system-x86_64 qemu_get_ram_ptr > >> 12528 0.7739 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> native_read_msr_safe > >> 10979 0.6782 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> copy_user_generic_string > >> 9979 0.6165 qemu-system-x86_64 virtqueue_get_head > >> 9371 0.5789 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule > >> 8333 0.5148 qemu-system-x86_64 virtqueue_avail_bytes > >> 7899 0.4880 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light > >> 7289 0.4503 qemu-system-x86_64 main_loop_wait > >> 7217 0.4458 qemu-system-x86_64 lduw_phys > >> > > This is almost entirely host virtio. I can reduce native_write_msr_safe > by a bit, but not much. > > >> 6821 0.4214 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> audit_syscall_exit > >> 6749 0.4169 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select > >> 5919 0.3657 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> audit_syscall_entry > >> 5466 0.3377 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree > >> 4887 0.3019 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput > >> 4689 0.2897 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to > >> 4636 0.2864 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle > >> > > Still not idle=poll, it may shave off 0.2%.
Won't this affect SMT in a negative way? (OK, I am not running SMT now, but eventually we will be) A long time ago, we tested P4's with HT, and a polling idle in one thread always negatively impacted performance in the sibling thread. FWIW, I did try idle=halt, and it was slightly worse. I did get a chance to try the latest qemu (master and next heads). I have been running into a problem with virtIO stor driver for windows on anything much newer than kvm-87. I compiled the driver from the new git tree, installed OK, but still had the same error. Finally, I removed the serial number feature in the virtio-blk in qemu, and I can now get the driver to work in Windows. So, not really any good news on performance with latest qemu builds. Performance is slightly worse: qemu-kvm-87 user nice system irq softirq guest idle iowait 5.79 0.00 9.28 0.08 1.00 20.81 58.78 4.26 total busy: 36.97 qemu-kvm-88-905-g6025b2d (master) user nice system irq softirq guest idle iowait 6.57 0.00 10.86 0.08 1.02 21.35 55.90 4.21 total busy: 39.89 qemu-kvm-88-910-gbf8a05b (next) user nice system irq softirq guest idle iowait 6.60 0.00 10.91 0.09 1.03 21.35 55.71 4.31 total busy: 39.98 diff of profiles, p1=qemu-kvm-87, p2=qemu-master > profile1 is qemu-kvm-87 > profile2 is qemu-master > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit > mask of 0x00 (No unit mask) count 10000000 > total samples (ts1) for profile1 is 1616921 > total samples (ts2) for profile2 is 1752347 (includes multiplier of 0.995420) > functions which have a abs(pct2-pct1) < 0.06 are not displayed > > pct2: pct1: > > 100* 100* pct2 > > s1 s2 s2/s1 s2/ts1 s1/ts1 -pct1 symbol > bin > --------- --------- ------- ------- ------- ------ ------ > --- > 879611 907883 1.03/1 56.149 54.400 1.749 vmx_vcpu_run > kvm > 614 11553 18.82/1 0.715 0.038 0.677 gfn_to_memslot_unali > kvm.ko > 34511 44922 1.30/1 2.778 2.134 0.644 phys_page_find_alloc > qemu > 2866 9334 3.26/1 0.577 0.177 0.400 paging64_walk_addr > kvm.ko > 11139 17200 1.54/1 1.064 0.689 0.375 copy_user_generic_st > vmlinux > 3100 7108 2.29/1 0.440 0.192 0.248 x86_decode_insn > kvm.ko > 8169 11873 1.45/1 0.734 0.505 0.229 virtqueue_avail_byte > qemu > 1103 4540 4.12/1 0.281 0.068 0.213 kvm_read_guest > kvm.ko > 17427 20401 1.17/1 1.262 1.078 0.184 memcpy > libc > 0 2905 0.180 0.000 0.180 gfn_to_pfn > kvm.ko > 1831 4328 2.36/1 0.268 0.113 0.154 x86_emulate_insn > kvm.ko > 65 2431 37.41/1 0.150 0.004 0.146 emulator_read_emulat > kvm.ko > 14922 17196 1.15/1 1.064 0.923 0.141 qemu_get_ram_ptr > qemu > 545 2724 5.00/1 0.168 0.034 0.135 emulate_instruction > kvm.ko > 599 2464 4.11/1 0.152 0.037 0.115 kvm_read_guest_page > kvm.ko > 503 2355 4.68/1 0.146 0.031 0.115 gfn_to_hva > kvm.ko > 1076 2918 2.71/1 0.181 0.067 0.114 memcpy_c > vmlinux > 594 2241 3.77/1 0.139 0.037 0.102 next_segment > kvm.ko > 1680 3248 1.93/1 0.201 0.104 0.097 pipe_poll > vmlinux > 0 1463 0.090 0.000 0.090 subpage_readl > qemu > 0 1363 0.084 0.000 0.084 msix_enabled > qemu > 527 1883 3.57/1 0.116 0.033 0.084 paging64_gpte_to_gfn > kvm.ko > 962 2223 2.31/1 0.138 0.059 0.078 do_insn_fetch > kvm.ko > 348 1605 4.61/1 0.099 0.022 0.078 is_rsvd_bits_set > kvm.ko > 520 1763 3.39/1 0.109 0.032 0.077 unalias_gfn > kvm.ko > 1 1163 1163.65/1 0.072 0.000 0.072 tdp_page_fault > kvm.ko > 3827 4912 1.28/1 0.304 0.237 0.067 __down_read > vmlinux > 0 1014 0.063 0.000 0.063 mapping_level > kvm.ko > 973 0 0.000 0.060 -0.060 pm_ioport_readl > qemu > 1635 528 1/3.09 0.033 0.101 -0.068 ioport_read > qemu > 2179 1017 1/2.14 0.063 0.135 -0.072 kvm_emulate_pio > kvm.ko > 25141 23722 1/1.06 1.467 1.555 -0.088 native_write_msr_saf > vmlinux > 1560 0 0.000 0.096 -0.096 eventfd_poll > vmlinux > ------- ------- ------ > 105.100 97.450 7.650 18x more samples for gfn_to_memslot_unali*, 37x for emulator_read_emula*, and more CPU time in guest mode. One other thing I decided to try was some cpu binding. I know this is not practical for production, but I wanted to see if there's any benefit at all. One reason was that a coworker here tried binding the qemu thread for the vcpu and the qemu IO thread to the same cpu. On a networking test, guest->local-host, throughput was up about 2x. Obviously there was a nice effect of being on the same cache. I wondered, even without full bore throughput tests, could we see any benefit here. So, I bound each pair of VMs to a dedicated core. What I saw was about a 6% improvement in performance. For a system which has pretty incredible memory performance and is not that busy, I was surprised that I got 6%. I am not advocating binding, but what I do wonder: on 1-way VMs, if we keep all the qemu threads together on the same CPU, but still allowing the scheduler to move them (all of them at once) to different cpus over time, would we see the same benefit? One other thing: So far I have not been using preadv/pwritev. I assume I need a more recent glibc (on 2.5 now) for qemu to take advantage of this? Thanks! -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html