On Wed, 24 Jun 2026 at 18:46, Ackerley Tng <[email protected]> wrote:
>
> Sean Christopherson <[email protected]> writes:
>
> > On Fri, Jun 19, 2026, Fuad Tabba wrote:
> >> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> >> <[email protected]> wrote:
> >> >
> >> > From: Ackerley Tng <[email protected]>
> >> >
> >> > When memory in guest_memfd is converted from private to shared, the
> >> > platform-specific state associated with the guest-private pages must be
> >> > invalidated or cleaned up.
> >> >
> >> > Iterate over the folios in the affected range and call the
> >> > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> >> > architectures to perform necessary teardown, such as updating hardware
> >> > metadata or encryption states, before the pages are transitioned to the
> >> > shared state.
> >> >
> >> > Invoke this helper after indicating to KVM's mmu code that an
> >> > invalidation
> >> > is in progress to stop in-flight page faults from succeeding.
> >> >
> >> > Reviewed-by: Fuad Tabba <[email protected]>
> >> > Signed-off-by: Ackerley Tng <[email protected]>
> >>
> >> Coming back to this after working through the arm64/pKVM side. My
> >> Reviewed-by here is from the previous round and the patch hasn't
> >> changed, but I missed an implication for arm64.
> >>
> >> kvm_arch_gmem_invalidate() is now called from two paths with the same
> >> (start, end) signature: folio teardown (kvm_gmem_free_folio) and
> >> private->shared conversion (here). For SNP/TDX that's fine, conversion is
> >> destructive anyway. For pKVM the two need opposite content semantics:
> >> conversion must preserve the page in place (same physical page, the point
> >> of in-place conversion without encryption), while teardown must scrub it
> >> before returning it to the host.
> >>
> >> The hook gets only a pfn range with no indication of which caller it's
> >> serving, so arm64 can't give the two paths the behaviour they need. It
> >> would help to signal intent on the conversion path: a reason/flag, a
> >> separate hook, or not routing non-destructive conversion through the
> >> teardown hook.
> >>
> >> arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
> >> second caller now, and it's cheaper to leave room for the distinction
> >> than to change a generic contract other arches depend on later.
> >
> > Crud. It may not be urgent for arm64, but it's urgent for other reasons
> > that
> > I "can't" describe in detail at the moment, and even if that weren't the
> > case, I
> > think we should clean things up now. More below.
> >
> >> > virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> >> > 1 file changed, 41 insertions(+)
> >> >
> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> > index 433f79047b9d1..3c94442bc8131 100644
> >> > --- a/virt/kvm/guest_memfd.c
> >> > +++ b/virt/kvm/guest_memfd.c
> >> > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct
> >> > inode *inode, pgoff_t start,
> >> > return safe;
> >> > }
> >> >
> >> > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> >> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start,
> >> > pgoff_t end)
> >
> > Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed. It's not
> > "invalidating" anything, it's much more of a "free" callback, as SNP uses
> > it to
> > put physical pages back into a shared state when a maybe-private folio is
> > freed.
> >
> > As Fuad points out, (ab)using that hook for the private=>shared conversion
> > case
> > "works", but not broadly. And it makes the bad name worse, because it's
> > called
> > from code that _is_ doing true invalidations. For pKVM, it may not even
> > need to
> > do anything invalidation-like.
> >
>
> Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
> especially when conversions also calls
> kvm_gmem_invalidate_{start,end}() and those do different things.
>
> > To avoid a conflict with patches that are going to have priority over this
> > series,
> > to set the stage for arm64 support, and to avoid avoid bleeding vendor
> > details
> > into guest_memfd, as if they are core guest_memfd behavior (only SNP needs
> > the
> > "invalidation" on this specific transition), I think we should add an arch
> > hook
> > to do conversions straightaway.
> >
> > Unless there's a clever option I'm missing, it'll mean adding yet another
> > HAVE_KVM_ARCH_GMEM_XXX flag? Hmm, especially because IIUC, arm64/pKVM
> > doesn't
> > need a callback for this case, only the free_folio case.
> >
> >> > +{
> >> > + struct folio_batch fbatch;
> >> > + pgoff_t next = start;
> >> > + int i;
> >> > +
> >> > + folio_batch_init(&fbatch);
> >> > + while (filemap_get_folios(inode->i_mapping, &next, end - 1,
> >> > &fbatch)) {
> >> > + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> >> > + struct folio *folio = fbatch.folios[i];
> >> > + pgoff_t start_index, end_index;
> >> > + kvm_pfn_t start_pfn, end_pfn;
> >> > +
> >> > + start_index = max(start, folio->index);
> >> > + end_index = min(end, folio_next_index(folio));
> >> > + /*
> >> > + * end_index is either in folio or points to
> >> > + * the first page of the next folio. Hence,
> >> > + * all pages in range [start_index, end_index)
> >> > + * are contiguous.
> >> > + */
> >> > + start_pfn = folio_file_pfn(folio, start_index);
> >> > + end_pfn = start_pfn + end_index - start_index;
> >> > +
> >> > + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> >> > + }
> >> > +
> >> > + folio_batch_release(&fbatch);
> >> > + cond_resched();
> >> > + }
> >> > +}
> >> > +#else
> >> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start,
> >> > pgoff_t end) {}
> >> > +#endif
> >> > +
> >> > static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> >> > size_t nr_pages, uint64_t attrs,
> >> > pgoff_t *err_index)
> >> > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode
> >> > *inode, pgoff_t start,
> >> > */
> >> >
> >> > kvm_gmem_invalidate_start(inode, start, end);
> >> > +
> >> > + if (!to_private)
> >> > + kvm_gmem_invalidate(inode, start, end);
> >
> > E.g. instead make this something like this?
> >
> > kvm_gmem_set_pfn_attributes(...)
> >
> > Hrm, though that wastes folio lookups in the to_private case. So maybe
> > just this,
> > assuming pKVM doesn't need to take additional action on conversions?
> >
> > if (!to_private)
> > kvm_gmem_make_shared(...)
> >
> > Actually, if we do that, then we don't need a separate arch hook, just a
> > separate
> > config. It'll still bleed SNP details into guest_memfd, but it'll at least
> > be
> > done in a way that's more explicitly arch specific (and it's no different
> > than
> > what we already do for PREPARE...).
> >
>
> pKVM needs some arch guest_memfd lifecycle functions that
>
> + for conversion, doesn't do anything,
> + for teardown, resets page state (IIUC it'll be reset to
> PKVM_PAGE_OWNED (by the host))
>
> So I think we need different functions for those two stages in the
> lifecycle of a page with guest_memfd? What if we have
Yes, the split is what I was after. One PFN-range hook for both
teardown and private->shared conversion can't tell them apart, and for
pKVM the two want opposite content semantics.
Two configs rather than one is right, since the needs are independent.
pKVM wants teardown but not conversion.
>
> CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates
>
> + kvm_gmem_should_set_pfn_attributes(attributes) and
> .gmem_should_set_pfn_attributes
> + kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
> .gmem_set_pfn_attributes
>
> CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates
>
> + kvm_gmem_teardown() and .gmem_teardown
>
> SNP:
>
> + .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
> and sev_gmem_should_set_pfn_attributes returns !is_private
> + Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
> + .gmem_teardown = sev_gmem_set_pfn_attributes
>
> TDX:
>
> + Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
> + Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN
>
> pKVM:
>
> + Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
> + .gmem_teardown = pkvm_gmem_set_pfn_attributes
Right for pKVM:
- teardown is not a no-op: it scrubs the page and resets the host
state to PKVM_PAGE_OWNED before the page returns to the host. Your
"reset to PKVM_PAGE_OWNED" reading is correct.
- the arch conversion hook is a no-op, so disabling SET_PFN_ATTRIBUTES
is correct. Conversions in pKVM are guest-initiated: the
share/unshare hypercall does the stage-2 and page-state transition
at EL2. The host still runs the generic conversion path (safety
check, attribute update) and accepts the conversion, but EL2 has
already done the transition, so there is nothing arch-specific left
for a hook to do. The page is preserved in place (no scrub).
If pKVM does turn out to need a step on conversion, it stays
non-destructive either way, and it can opt in later without touching
a contract others depend on.
Folding the direction check behind .gmem_should_set_pfn_attributes is
a good cleanup, it keeps the !to_private check out of generic gmem.
On naming: gmem_teardown is better. gmem_set_pfn_attributes reads a
bit close to KVM_SET_MEMORY_ATTRIBUTES, but naming is hard. :)
>
> Suzuki, does this work for ARM CCA?
>
> This way,
>
> + The if (is_private) check doesn't leak SNP details into guest_memfd
> + .gmem_make_shared doesn't stick out without a .gmem_make_private
> + .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
> conceptually as lifecycle hooks
>
> + I think the private/shared check for prepare can also be folded into
> preparation.
> + Preparation perhaps doesn't need a should_prepare equivalent since
> there's no iteration and getting the gfn is just doing some math?
> + In another patch series?
Agreed, separate series.
Thank you Ackerley!
/fuad
>
> > E.g. this? There will still be a looming rename conflict, but that's easy
> > enough
> > to handle.
> >
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index 9ce5be7843f2..8aead0abd788 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct
> > inode *inode, pgoff_t start,
> > return safe;
> > }
> >
> > -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> > -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start,
> > pgoff_t end)
> > +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> > +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start,
> > pgoff_t end)
> > {
> > struct folio_batch fbatch;
> > pgoff_t next = start;
> > @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode,
> > pgoff_t start, pgoff_t end)
> > }
> > }
> > #else
> > -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start,
> > pgoff_t end) {}
> > +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start,
> > pgoff_t end) { }
> > #endif
> >
> > static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode
> > *inode, pgoff_t start,
> > kvm_gmem_invalidate_start(inode, start, end);
> >
> > if (!to_private)
> > - kvm_gmem_invalidate(inode, start, end);
> > + kvm_gmem_make_shared(inode, start, end);
> >
> > mas_store_prealloc(&mas, xa_mk_value(attrs));