[RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V

Jeremi Piotrowski Fri, 20 Jun 2025 11:12:10 -0700

Hi Sean/Parolo/Vitaly,

Wanted to get your opinion on this change. Let me first introduce the scenario:


We have been testing kata containers (containers wrapped in VM) in Azure, and
found some significant issues with TLB flushing. This is a popular workload and
requires launching many nested VMs quickly. When testing on a 64 core Intel VM
(D64s_v5 in case someone is wondering), spinning up some 150-ish nested VMs in
parallel, performance starts getting worse the more nested VMs are already
running, CPU usage spikes to 100% on all cores and doesn't settle even when all
nested VMs boot up. On an idle system a single nested VMs boots within seconds,
but once we have a couple dozen running or so (doing nothing inside), boot time
gets longer and longer for each new nested VM, they start hitting startup
timeout etc. In some cases we never reach the point where all nested VMs are
up and running.

Investigating the issue we found that this can't be reproduced on AMD and on
Intel when EPT is disabled. In both these cases the scenario completes within
20s or so. TPD_MMU or not doesn't make a difference. With EPT=Y the case takes
minutes.Out of curiousity I also ran the test case on an n4-standard-64 VM on
GCP and found that EPT=Y runs in ~30s, while EPT=N runs in ~20s (which I found
slightly interesting).

So that's when we starting looking at the TLB flushing code and found that
INVEPT.global is used on every CPU migration and that it's an expensive
function on Hyper-V. It also has an impact on every running nested VM, so we
end up with lots of INVEPT.global calls - we reach 2000 calls/s before we're
essentially stuck in 100% guest ttime.  That's why I'm looking at tweaking the
TLB flushing behavior to avoid it. I came across past discussions on this topic
([^1]) and after some thinking see two options:

1. Do you see a way to optimize this generically to avoid KVM_REQ_TLB_FLUSH on
migration in current KVM? In nested (as in: KVM running nested) I think we
rarely see CPU pinning used the way we it is on baremetal so it's not a rare of
an operation. Much has also changed since [^1] and with kvm_mmu_reset_context()
still being called in many paths we might be over flushing. Perhaps a loop
flushing individual roots with roles that do not have a post_set_xxx hook that
does flushing?

2. We can approach this in a Hyper-V specific way, using the dedicated flush
hypercall, which is what the following RFC patch does. This hypercall acts as a
broadcast INVEPT.single. I believe that using the flush hypercall in
flush_tlb_current() is sufficient to ensure the right semantics and correctness.
The one thing I haven't made up my mind about yet is whether we could still use
a flush of the current root on migration or not - I can imagine at most an
INVEPT.single, I also haven't yet figured out how that could be plumbed in if
it's really necessary (can't put it in KVM_REQ_TLB_FLUSH because that would
break the assumption that it is stronger than KVM_REQ_TLB_FLUSH_CURRENT).

With 2. the performance is comparable to EPT=N on Intel, roughly 20s for the
test scenario.

Let me know what you think about this and if you have any suggestions.

Best wishes,
Jeremi

[^1]: https://lore.kernel.org/kvm/yqljnbbp%2feous...@google.com/

Jeremi Piotrowski (1):
  KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/vmx.c          | 20 +++++++++++++++++---
 arch/x86/kvm/vmx/vmx_onhyperv.h |  6 ++++++
 arch/x86/kvm/x86.c              |  3 +++
 4 files changed, 27 insertions(+), 3 deletions(-)

-- 
2.39.5

[RFC PATCH 0/1] Tweak TLB flushing when VMX is running on Hyper-V

Reply via email to