On Tue, 2009-09-01 at 21:23 +0300, Avi Kivity wrote:
> On 09/01/2009 09:12 PM, Andrew Theurer wrote:
> > Here's a run from branch debugreg with thread debugreg storage +
> > conditionally reload dr6:
> >
> > user  nice  system   irq  softirq guest   idle  iowait
> > 5.79  0.00    9.28  0.08     1.00 20.81  58.78    4.26
> > total busy: 36.97
> >
> > Previous run that had avoided calling adjust_vmx_controls twice:
> >
> > user  nice  system   irq  softirq guest   idle  iowait
> > 5.81  0.00    9.48  0.08    1.04  21.32  57.86    4.41
> > total busy: 37.73
> >
> > A relative reduction CPU cycles of 2%
> >    
> 
> That was an wasy fruit to pick.  To bad it was a regression that we 
> introduced.
> 
> > new oprofile:
> >
> >    
> >> samples  %        app name                 symbol name
> >> 876648   54.1555  kvm-intel.ko             vmx_vcpu_run
> >> 37595     2.3225  qemu-system-x86_64       cpu_physical_memory_rw
> >> 35623     2.2006  qemu-system-x86_64       phys_page_find_alloc
> >> 24874     1.5366  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> native_write_msr_safe
> >> 17710     1.0940  libc-2.5.so              memcpy
> >> 14664     0.9059  kvm.ko                   kvm_arch_vcpu_ioctl_run
> >> 14577     0.9005  qemu-system-x86_64       qemu_get_ram_ptr
> >> 12528     0.7739  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> native_read_msr_safe
> >> 10979     0.6782  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> copy_user_generic_string
> >> 9979      0.6165  qemu-system-x86_64       virtqueue_get_head
> >> 9371      0.5789  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule
> >> 8333      0.5148  qemu-system-x86_64       virtqueue_avail_bytes
> >> 7899      0.4880  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light
> >> 7289      0.4503  qemu-system-x86_64       main_loop_wait
> >> 7217      0.4458  qemu-system-x86_64       lduw_phys
> >>      
> 
> This is almost entirely host virtio.  I can reduce native_write_msr_safe 
> by a bit, but not much.
> 
> >> 6821      0.4214  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> audit_syscall_exit
> >> 6749      0.4169  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select
> >> 5919      0.3657  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> audit_syscall_entry
> >> 5466      0.3377  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree
> >> 4887      0.3019  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput
> >> 4689      0.2897  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to
> >> 4636      0.2864  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle
> >>      
> 
> Still not idle=poll, it may shave off 0.2%.

Won't this affect SMT in a negative way?  (OK, I am not running SMT now,
but eventually we will be) A long time ago, we tested P4's with HT, and
a polling idle in one thread always negatively impacted performance in
the sibling thread.

FWIW, I did try idle=halt, and it was slightly worse.

I did get a chance to try the latest qemu (master and next heads).  I
have been running into a problem with virtIO stor driver for windows on
anything much newer than kvm-87.  I compiled the driver from the new git
tree, installed OK, but still had the same error.  Finally, I removed
the serial number feature in the virtio-blk in qemu, and I can now get
the driver to work in Windows.

So, not really any good news on performance with latest qemu builds.
Performance is slightly worse:

qemu-kvm-87
user  nice  system   irq  softirq guest   idle  iowait
5.79  0.00    9.28  0.08     1.00 20.81  58.78    4.26
total busy: 36.97

qemu-kvm-88-905-g6025b2d (master)
user  nice  system   irq  softirq guest   idle  iowait
6.57  0.00   10.86  0.08     1.02 21.35  55.90    4.21
total busy: 39.89

qemu-kvm-88-910-gbf8a05b (next)
user  nice  system   irq  softirq guest   idle  iowait
6.60  0.00  10.91   0.09     1.03 21.35  55.71    4.31
total busy: 39.98

diff of profiles, p1=qemu-kvm-87, p2=qemu-master


> profile1 is qemu-kvm-87
> profile2 is qemu-master
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit 
> mask of 0x00 (No unit mask) count 10000000
> total samples (ts1) for profile1 is 1616921 
> total samples (ts2) for profile2 is 1752347 (includes multiplier of 0.995420)
> functions which have a abs(pct2-pct1) < 0.06 are not displayed
> 
>                               pct2:   pct1:                                   
>    
>                                100*    100*  pct2                             
>    
>        s1        s2   s2/s1  s2/ts1  s1/ts1  -pct1 symbol                     
> bin
> --------- --------- ------- ------- ------- ------ ------                     
> ---
>    879611    907883  1.03/1  56.149  54.400  1.749 vmx_vcpu_run               
> kvm
>       614     11553 18.82/1   0.715   0.038  0.677 gfn_to_memslot_unali    
> kvm.ko
>     34511     44922  1.30/1   2.778   2.134  0.644 phys_page_find_alloc      
> qemu
>      2866      9334  3.26/1   0.577   0.177  0.400 paging64_walk_addr      
> kvm.ko
>     11139     17200  1.54/1   1.064   0.689  0.375 copy_user_generic_st   
> vmlinux
>      3100      7108  2.29/1   0.440   0.192  0.248 x86_decode_insn         
> kvm.ko
>      8169     11873  1.45/1   0.734   0.505  0.229 virtqueue_avail_byte      
> qemu
>      1103      4540  4.12/1   0.281   0.068  0.213 kvm_read_guest          
> kvm.ko
>     17427     20401  1.17/1   1.262   1.078  0.184 memcpy                    
> libc
>         0      2905           0.180   0.000  0.180 gfn_to_pfn              
> kvm.ko
>      1831      4328  2.36/1   0.268   0.113  0.154 x86_emulate_insn        
> kvm.ko
>        65      2431 37.41/1   0.150   0.004  0.146 emulator_read_emulat    
> kvm.ko
>     14922     17196  1.15/1   1.064   0.923  0.141 qemu_get_ram_ptr          
> qemu
>       545      2724  5.00/1   0.168   0.034  0.135 emulate_instruction     
> kvm.ko
>       599      2464  4.11/1   0.152   0.037  0.115 kvm_read_guest_page     
> kvm.ko
>       503      2355  4.68/1   0.146   0.031  0.115 gfn_to_hva              
> kvm.ko
>      1076      2918  2.71/1   0.181   0.067  0.114 memcpy_c               
> vmlinux
>       594      2241  3.77/1   0.139   0.037  0.102 next_segment            
> kvm.ko
>      1680      3248  1.93/1   0.201   0.104  0.097 pipe_poll              
> vmlinux
>         0      1463           0.090   0.000  0.090 subpage_readl             
> qemu
>         0      1363           0.084   0.000  0.084 msix_enabled              
> qemu
>       527      1883  3.57/1   0.116   0.033  0.084 paging64_gpte_to_gfn    
> kvm.ko
>       962      2223  2.31/1   0.138   0.059  0.078 do_insn_fetch           
> kvm.ko
>       348      1605  4.61/1   0.099   0.022  0.078 is_rsvd_bits_set        
> kvm.ko
>       520      1763  3.39/1   0.109   0.032  0.077 unalias_gfn             
> kvm.ko
>         1      1163 1163.65/1   0.072   0.000  0.072 tdp_page_fault          
> kvm.ko
>      3827      4912  1.28/1   0.304   0.237  0.067 __down_read            
> vmlinux
>         0      1014           0.063   0.000  0.063 mapping_level           
> kvm.ko
>       973         0           0.000   0.060 -0.060 pm_ioport_readl           
> qemu
>      1635       528  1/3.09   0.033   0.101 -0.068 ioport_read               
> qemu
>      2179      1017  1/2.14   0.063   0.135 -0.072 kvm_emulate_pio         
> kvm.ko
>     25141     23722  1/1.06   1.467   1.555 -0.088 native_write_msr_saf   
> vmlinux
>      1560         0           0.000   0.096 -0.096 eventfd_poll           
> vmlinux
>                             ------- ------- ------  
>                             105.100  97.450  7.650  


18x more samples for gfn_to_memslot_unali*, 37x for
emulator_read_emula*, and more CPU time in guest mode.

One other thing I decided to try was some cpu binding.  I know this is
not practical for production, but I wanted to see if there's any benefit
at all.  One reason was that a coworker here tried binding the qemu
thread for the vcpu and the qemu IO thread to the same cpu.  On a
networking test, guest->local-host, throughput was up about 2x.
Obviously there was a nice effect of being on the same cache.  I
wondered, even without full bore throughput tests, could we see any
benefit here.  So, I bound each pair of VMs to a dedicated core.  What I
saw was about a 6% improvement in performance.  For a system which has
pretty incredible memory performance and is not that busy, I was
surprised that I got 6%.  I am not advocating binding, but what I do
wonder:  on 1-way VMs, if we keep all the qemu threads together on the
same CPU, but still allowing the scheduler to move them (all of them at
once) to different cpus over time, would we see the same benefit?

One other thing:  So far I have not been using preadv/pwritev.  I assume
I need a more recent glibc (on 2.5 now) for qemu to take advantage of
this?

Thanks!

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to