On 24/06/2026 18:46, Ackerley Tng wrote:
Sean Christopherson <[email protected]> writes:
On Fri, Jun 19, 2026, Fuad Tabba wrote:
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<[email protected]> wrote:
From: Ackerley Tng <[email protected]>
When memory in guest_memfd is converted from private to shared, the
platform-specific state associated with the guest-private pages must be
invalidated or cleaned up.
Iterate over the folios in the affected range and call the
kvm_arch_gmem_invalidate() hook for each PFN range. This allows
architectures to perform necessary teardown, such as updating hardware
metadata or encryption states, before the pages are transitioned to the
shared state.
Invoke this helper after indicating to KVM's mmu code that an invalidation
is in progress to stop in-flight page faults from succeeding.
Reviewed-by: Fuad Tabba <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Coming back to this after working through the arm64/pKVM side. My
Reviewed-by here is from the previous round and the patch hasn't
changed, but I missed an implication for arm64.
kvm_arch_gmem_invalidate() is now called from two paths with the same
(start, end) signature: folio teardown (kvm_gmem_free_folio) and
private->shared conversion (here). For SNP/TDX that's fine, conversion is
destructive anyway. For pKVM the two need opposite content semantics:
conversion must preserve the page in place (same physical page, the point
of in-place conversion without encryption), while teardown must scrub it
before returning it to the host.
The hook gets only a pfn range with no indication of which caller it's
serving, so arm64 can't give the two paths the behaviour they need. It
would help to signal intent on the conversion path: a reason/flag, a
separate hook, or not routing non-destructive conversion through the
teardown hook.
arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
second caller now, and it's cheaper to leave room for the distinction
than to change a generic contract other arches depend on later.
Crud. It may not be urgent for arm64, but it's urgent for other reasons that
I "can't" describe in detail at the moment, and even if that weren't the case, I
think we should clean things up now. More below.
virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 433f79047b9d1..3c94442bc8131 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode
*inode, pgoff_t start,
return safe;
}
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
+static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t
end)
Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed. It's not
"invalidating" anything, it's much more of a "free" callback, as SNP uses it to
put physical pages back into a shared state when a maybe-private folio is freed.
As Fuad points out, (ab)using that hook for the private=>shared conversion case
"works", but not broadly. And it makes the bad name worse, because it's called
from code that _is_ doing true invalidations. For pKVM, it may not even need to
do anything invalidation-like.
Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
especially when conversions also calls
kvm_gmem_invalidate_{start,end}() and those do different things.
To avoid a conflict with patches that are going to have priority over this
series,
to set the stage for arm64 support, and to avoid avoid bleeding vendor details
into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
"invalidation" on this specific transition), I think we should add an arch hook
to do conversions straightaway.
Unless there's a clever option I'm missing, it'll mean adding yet another
HAVE_KVM_ARCH_GMEM_XXX flag? Hmm, especially because IIUC, arm64/pKVM doesn't
need a callback for this case, only the free_folio case.
+{
+ struct folio_batch fbatch;
+ pgoff_t next = start;
+ int i;
+
+ folio_batch_init(&fbatch);
+ while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
+ for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+ struct folio *folio = fbatch.folios[i];
+ pgoff_t start_index, end_index;
+ kvm_pfn_t start_pfn, end_pfn;
+
+ start_index = max(start, folio->index);
+ end_index = min(end, folio_next_index(folio));
+ /*
+ * end_index is either in folio or points to
+ * the first page of the next folio. Hence,
+ * all pages in range [start_index, end_index)
+ * are contiguous.
+ */
+ start_pfn = folio_file_pfn(folio, start_index);
+ end_pfn = start_pfn + end_index - start_index;
+
+ kvm_arch_gmem_invalidate(start_pfn, end_pfn);
+ }
+
+ folio_batch_release(&fbatch);
+ cond_resched();
+ }
+}
+#else
+static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t
end) {}
+#endif
+
static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
size_t nr_pages, uint64_t attrs,
pgoff_t *err_index)
@@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode,
pgoff_t start,
*/
kvm_gmem_invalidate_start(inode, start, end);
+
+ if (!to_private)
+ kvm_gmem_invalidate(inode, start, end);
E.g. instead make this something like this?
kvm_gmem_set_pfn_attributes(...)
Hrm, though that wastes folio lookups in the to_private case. So maybe just
this,
assuming pKVM doesn't need to take additional action on conversions?
if (!to_private)
kvm_gmem_make_shared(...)
Actually, if we do that, then we don't need a separate arch hook, just a
separate
config. It'll still bleed SNP details into guest_memfd, but it'll at least be
done in a way that's more explicitly arch specific (and it's no different than
what we already do for PREPARE...).
pKVM needs some arch guest_memfd lifecycle functions that
+ for conversion, doesn't do anything,
+ for teardown, resets page state (IIUC it'll be reset to
PKVM_PAGE_OWNED (by the host))
So I think we need different functions for those two stages in the
lifecycle of a page with guest_memfd? What if we have
CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates
+ kvm_gmem_should_set_pfn_attributes(attributes) and
.gmem_should_set_pfn_attributes
+ kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
.gmem_set_pfn_attributes
CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates
+ kvm_gmem_teardown() and .gmem_teardown
SNP:
+ .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
and sev_gmem_should_set_pfn_attributes returns !is_private
+ Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
+ .gmem_teardown = sev_gmem_set_pfn_attributes
TDX:
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN
pKVM:
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ .gmem_teardown = pkvm_gmem_set_pfn_attributes
Suzuki, does this work for ARM CCA?
Yep, that works for us. For CCA we would :
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN
In the future we might utilise the gmem_set_pfn_attributes call back.
Thanks
Suzuki
This way,
+ The if (is_private) check doesn't leak SNP details into guest_memfd
+ .gmem_make_shared doesn't stick out without a .gmem_make_private
+ .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
conceptually as lifecycle hooks
+ I think the private/shared check for prepare can also be folded into
preparation.
+ Preparation perhaps doesn't need a should_prepare equivalent since
there's no iteration and getting the gfn is just doing some math?
+ In another patch series?
E.g. this? There will still be a looming rename conflict, but that's easy
enough
to handle.
diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
index 9ce5be7843f2..8aead0abd788 100644
--- virt/kvm/guest_memfd.c
+++ virt/kvm/guest_memfd.c
@@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode
*inode, pgoff_t start,
return safe;
}
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t
end)
+#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
+static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t
end)
{
struct folio_batch fbatch;
pgoff_t next = start;
@@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode,
pgoff_t start, pgoff_t end)
}
}
#else
-static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t
end) {}
+static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t
end) { }
#endif
static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
@@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode,
pgoff_t start,
kvm_gmem_invalidate_start(inode, start, end);
if (!to_private)
- kvm_gmem_invalidate(inode, start, end);
+ kvm_gmem_make_shared(inode, start, end);
mas_store_prealloc(&mas, xa_mk_value(attrs));