On 11/10/2014 02:15 PM, Paolo Bonzini wrote:

On 10/11/2014 11:45, Gleb Natapov wrote:
I tried making also the other shared MSRs the same between guest and
host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
do want to dig out my old Core 2 and see how the new test fares, but it
really looks like your patch will be in 3.19.
Please test on wide variety of HW before final decision.
Yes, definitely.

Also it would
be nice to ask Intel what is expected overhead. It is awesome if they
mange to add EFER switching with non measurable overhead, but also hard
to believe :)
So let's see what happens.  Sneak preview: the result is definitely worth
asking Intel about.

I ran these benchmarks with a stock 3.16.6 KVM.  Instead I patched
kvm-unit-tests to set EFER.SCE in enable_nx.  This makes it much simpler
for others to reproduce the results.  I only ran the inl_from_qemu test.

Perf stat reports that the processor goes from 0.65 to 0.46
instructions per cycle, which is consistent with the improvement from
19k to 12k cycles per iteration.

Unpatched KVM-unit-tests:

      3,385,586,563 cycles                    #    3.189 GHz                    
 [83.25%]
      2,475,979,685 stalled-cycles-frontend   #   73.13% frontend cycles idle   
 [83.37%]
      2,083,556,270 stalled-cycles-backend    #   61.54% backend  cycles idle   
 [66.71%]
      1,573,854,041 instructions              #    0.46  insns per cycle
                                              #    1.57  stalled cycles per 
insn [83.20%]
        1.108486526 seconds time elapsed


Patched KVM-unit-tests:

      3,252,297,378 cycles                    #    3.147 GHz                    
 [83.32%]
      2,010,266,184 stalled-cycles-frontend   #   61.81% frontend cycles idle   
 [83.36%]
      1,560,371,769 stalled-cycles-backend    #   47.98% backend  cycles idle   
 [66.51%]
      2,133,698,018 instructions              #    0.66  insns per cycle
                                              #    0.94  stalled cycles per 
insn [83.45%]
        1.072395697 seconds time elapsed

Playing with other events shows that the unpatched benchmark has an
awful load of TLB misses

Unpatched:

             30,311 iTLB-loads
        464,641,844 dTLB-loads
         10,813,839 dTLB-load-misses          #    2.33% of all dTLB cache hits
         20436,027 iTLB-load-misses          #  67421.16% of all iTLB cache hits

Patched:

          1,440,033 iTLB-loads
        640,970,836 dTLB-loads
          2,345,112 dTLB-load-misses          #    0.37% of all dTLB cache hits
            270,884 iTLB-load-misses          #   18.81% of all iTLB cache hits

This is 100% reproducible.  The meaning of the numbers is clearer if you
look up the raw event numbers in the Intel manuals:

- iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB 
[second-level
TLB] hits. No page walk."

- iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that
cause page walks."

So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and
friends show that the unpatched KVM wastes about 0.1 seconds more than
the patched KVM on page walks:

Unpatched:

         22,583,440 r449             (cycles on dTLB store miss page walks)
         40,452,018 r408             (cycles on dTLB load miss page walks)
          2,115,981 r485             (cycles on iTLB miss page walks)
------------------------
         65,151,439 total

Patched:

         24,430,676 r449             (cycles on dTLB store miss page walks)
        196,017,693 r408             (cycles on dTLB load miss page walks)
        213,266,243 r485             (cycles on iTLB miss page walks)
-------------------------
        433,714,612 total

These 0.1 seconds probably are all on instructions that would have been
fast, since the slow instructions responsible for the low IPC are the
microcoded instructions including VMX and other privileged stuff.

Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM
and 260k in patched KVM.  Let's see where they come from:

Unpatched:

+  98.97%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
+   0.70%  qemu-kvm  [kernel.kallsyms]  [k] page_fault

It's expected that most TLB misses happen just before a page fault (there
are also events to count how many TLB misses do result in a page fault,
if you care about that), and thus are accounted to the first instruction of the
exception handler.

We do not know what causes second-level TLB _flushes_ but it's quite
expected that you'll have a TLB miss after them and possibly a page fault.
And anyway 98.97% of them coming from native_write_msr_safe is totally
anomalous.

A patched benchmark shows no second-level TLB flush occurs after a WRMSR:

+  72.41%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
+   9.07%  qemu-kvm  [kvm_intel]        [k] vmx_flush_tlb
+   6.60%  qemu-kvm  [kernel.kallsyms]  [k] set_pte_vaddr_pud
+   5.68%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_mm_range
+   4.87%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb
+   1.36%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_page


So basically VMX EFER writes are optimized, while non-VMX EFER writes
cause a TLB flush, at least on a Sandy Bridge.  Ouch!


It's not surprising [1].  Since the meaning of some PTE bits change [2], the
TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
if EFER changed between two invocations of the same VPID, which isn't the
case.

[1] after the fact
[2] although those bits were reserved with NXE=0, so they shouldn't have any TLB footprint
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to