On Tue, Aug 05, 2025 at 04:42:46PM -0700, Sean Christopherson wrote:
> On Tue, Aug 05, 2025, Jeremi Piotrowski wrote:
> > On 05/08/2025 01:09, Sean Christopherson wrote:
> > > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> > >> Sean Christopherson <sea...@google.com> writes:

(snip)

> > > 
> > > Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a 
> > > few
> > > vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  
> > > Hrm.
> > > 
> > > So I think the basic idea is so flawed/garbage that even enhancing it 
> > > with per-VM
> > > pCPU tracking wouldn't work.  I do think you've got the right idea with a 
> > > pCPU mask
> > > though, but instead of using a mask to scope IPIs, use it to elide TLB 
> > > flushes.
> > 
> > Sorry for the delay in replying, I've been sidetracked a bit.
> 
> No worries, I guarantee my delays will make your delays pale in comparison :-D
> 
> > I like this idea more, not special casing the TLB flushing approach per 
> > hypervisor is
> > preferable.
> > 
> > > 
> > > With the TDP MMU, KVM can have at most 6 non-nested roots active at any 
> > > given time:
> > > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask 
> > > for each
> > > TDP MMU root seems reasonable.  Then on task migration, instead of doing 
> > > a global
> > > INVEPT, only INVEPT the current and prev_roots (because getting a new 
> > > root will
> > > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if 
> > > the pCPU
> > > has already done a flush for the root.
> > 
> > Just to make sure I follow: current+prev_roots do you mean literally those
> > (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots?
> 
> The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure.
> 
> > So this would mean: on pCPU migration, check if current mmu has 
> > is_tdp_mmu_active()
> > and then perform the INVEPT-single over roots instead of INVEPT-global. 
> > Otherwise stick
> > to the KVM_REQ_TLB_FLUSH.
> 
> No, KVM would still need to ensure shadow roots are flushed as well, because 
> KVM
> doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()).
> More at the bottom.
> 
> > Would there need to be a check for is_guest_mode(), or that the switch is
> > coming from the vmx/nested.c? I suppose not because nested doesn't seem to
> > use TDP MMU.
> 
> Nested can use the TDP MMU, though there's practically no code in KVM that 
> explicitly
> deals with this scenario.  If L1 is using legacy shadow paging, i.e. is NOT 
> using
> EPT/NPT, then KVM will use the TDP MMU to map L2 (with 
> kvm_mmu_page_role.guest_mode=1
> to differentiate from the L1 TDP MMU).
> 
> > > Or we could do the optimized tracking for all roots.  x86 supports at 
> > > most 8192
> > > CPUs, which means 1KiB per root.  That doesn't seem at all painful given 
> > > that
> > > each shadow pages consumes 4KiB...
> > 
> > Similar question here: which all roots would need to be tracked+flushed for 
> > shadow
> > paging? pae_roots?
> 
> Same general answer, "root" and all "prev_roots" entries.  KVM uses up to two
> "struct kvm_mmu" instances to actually map memory into the guest: root_mmu and
> guest_mmu.  The third instance, nested_mmu, is used to model gva->gpa 
> translations
> for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to
> map memory into the guest, i.e. can't have entries in hardware TLBs.
> 
> The basic gist is to add a cpumask in each root, and then elide TLB flushes on
> pCPU migration if KVM has flushed the root at least once.  Patch 5/5 in the 
> attached
> set of patches provides a *very* rough sketch.  Hopefully its enough to 
> convey the
> core idea.
> 
> Patches 1-4 compile, but are otherwise untested.  I'll post patches 1-3 as a 
> small
> series once their tested, as those cleanups are worth doing irrespective of 
> any
> optimizations we make to pCPU migration.
> 

Thanks for the detailed explanation and the patches Sean!
I started working on extending patch 5, wanted to post it here to make sure I'm
on the right track.

It works in testing so far and shows promising performance - it gets rid of all
the pathological cases I saw before.

I haven't checked whether I broke SVM yet, and I need figure out a way to
always keep the cpumask "offstack" so that we don't blow up every struct
kvm_mmu_page instance with an inline cpumask - it needs to stay optional.

I also came across kvm_mmu_is_dummy_root(), that check is included in
root_to_sp(). Can you think of any other checks that we might need to handle?
>From 8fb6d18ad4cbdd1802df45be49358a6d6acf72a0 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sea...@google.com>
Date: Tue, 5 Aug 2025 15:58:13 -0700
Subject: [PATCH] KVM: VMX: Sketch in possible framework for eliding TLB
 flushes on pCPU migration

Not-Signed-off-by: Sean Christopherson <sea...@google.com>

(anyone that makes this work deserves full credit)

Not-yet-Signed-off-by: Jeremi Piotrowski <jpiotrow...@linux.microsoft.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 +++
 arch/x86/kvm/mmu/mmu.c             |  5 +++++
 arch/x86/kvm/mmu/mmu_internal.h    |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c         |  4 ++++
 arch/x86/kvm/vmx/main.c            |  1 +
 arch/x86/kvm/vmx/vmx.c             | 28 +++++++++++++++++++++-------
 arch/x86/kvm/vmx/x86_ops.h         |  1 +
 8 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8d50e3e0a19b..60351dd22f2f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -99,6 +99,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(alloc_root_cpu_mask)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4a391929cdb..a3d415c3ea8b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1801,6 +1801,9 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	/* Allocate per-root pCPU flush mask. */
+	void (*alloc_root_cpu_mask)(struct kvm_mmu_page *root);
+
 	/* Update external mapping with page table link. */
 	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				void *external_spt);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e06e2e89a8f..721ee8ea76bd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -20,6 +20,7 @@
 #include "ioapic.h"
 #include "mmu.h"
 #include "mmu_internal.h"
+#include <linux/cpumask.h>
 #include "tdp_mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
@@ -1820,6 +1821,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	free_page((unsigned long)sp->shadowed_translation);
+	free_cpumask_var(sp->cpu_flushed_mask);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -3827,6 +3829,9 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
+	if (level >= PT64_ROOT_4LEVEL)
+		kvm_x86_call(alloc_root_cpu_mask)(sp);
+
 	return __pa(sp->spt);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..5acb3dd34b36 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -7,6 +7,7 @@
 #include <asm/kvm_host.h>
 
 #include "mmu.h"
+#include <linux/cpumask.h>
 
 #ifdef CONFIG_KVM_PROVE_MMU
 #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
@@ -145,6 +146,9 @@ struct kvm_mmu_page {
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
 #endif
+
+	/* Mask tracking which host CPUs have flushed this EPT root */
+	cpumask_var_t cpu_flushed_mask;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..40c7f46f553c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -3,6 +3,7 @@
 
 #include "mmu.h"
 #include "mmu_internal.h"
+#include <linux/cpumask.h>
 #include "mmutrace.h"
 #include "tdp_iter.h"
 #include "tdp_mmu.h"
@@ -57,6 +58,7 @@ static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->external_spt);
 	free_page((unsigned long)sp->spt);
+	free_cpumask_var(sp->cpu_flushed_mask);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -293,6 +295,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	root = tdp_mmu_alloc_sp(vcpu);
 	tdp_mmu_init_sp(root, NULL, 0, role);
 
+	kvm_x86_call(alloc_root_cpu_mask)(root);
+
 	/*
 	 * TDP MMU roots are kept until they are explicitly invalidated, either
 	 * by a memslot update or by the destruction of the VM.  Initialize the
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..ec7f6899443d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -1005,6 +1005,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.write_tsc_multiplier = vt_op(write_tsc_multiplier),
 
 	.load_mmu_pgd = vt_op(load_mmu_pgd),
+	.alloc_root_cpu_mask = vmx_alloc_root_cpu_mask,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index eec2d866e7f1..a6d93624c2d4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/tboot.h>
 #include <linux/trace_events.h>
+#include <linux/cpumask.h>
 #include <linux/entry-kvm.h>
 
 #include <asm/apic.h>
@@ -62,6 +63,7 @@
 #include "kvm_cache_regs.h"
 #include "lapic.h"
 #include "mmu.h"
+#include "mmu/spte.h"
 #include "nested.h"
 #include "pmu.h"
 #include "sgx.h"
@@ -1450,7 +1452,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu);
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu);
 
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 {
@@ -1489,8 +1491,8 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu)
 		 * TLB entries from its previous association with the vCPU.
 		 */
 		if (enable_ept) {
-			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu);
-			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.root_mmu, cpu);
+			vmx_flush_ept_on_pcpu_migration(&vcpu->arch.guest_mmu, cpu);
 		} else {
 			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		}
@@ -3307,22 +3309,34 @@ void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
-static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa)
+void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root)
 {
+	WARN_ON_ONCE(!zalloc_cpumask_var(&root->cpu_flushed_mask,
+					GFP_KERNEL_ACCOUNT));
+}
+
+static void __vmx_flush_ept_on_pcpu_migration(hpa_t root_hpa, int cpu)
+{
+	struct kvm_mmu_page *root;
+
 	if (!VALID_PAGE(root_hpa))
 		return;
 
+	root = root_to_sp(root_hpa);
+	if (!root || cpumask_test_and_set_cpu(cpu, root->cpu_flushed_mask))
+		return;
+
 	vmx_flush_tlb_ept_root(root_hpa);
 }
 
-static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu)
+static void vmx_flush_ept_on_pcpu_migration(struct kvm_mmu *mmu, int cpu)
 {
 	int i;
 
-	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa);
+	__vmx_flush_ept_on_pcpu_migration(mmu->root.hpa, cpu);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
-		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa);
+		__vmx_flush_ept_on_pcpu_migration(mmu->prev_roots[i].hpa, cpu);
 }
 
 void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..4406d53e6ebe 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -84,6 +84,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
 void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_alloc_root_cpu_mask(struct kvm_mmu_page *root);
 void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
 u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
 void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
-- 
2.39.5

Reply via email to