Hi Sean, Frank, Lorenzo,

On Tue, Apr 21, 2026 at 10:08:48AM -0700, Frank van der Linden wrote:
> On Tue, Apr 21, 2026 at 9:31 AM Sean Christopherson <[email protected]> wrote:
> > Making guest_memfd responsible for zapping and restoring the direct map on 
> > a per-
> > folio basis feels wrong given the addition of AS_NO_DIRECT_MAP.  I 
> > especially don't
> > like that the "rules" for when an AS_NO_DIRECT_MAP folio has a direct map 
> > will vary
> > based on the owner, and even within an owner (e.g. guest_memfd) will be ad 
> > hoc.
> >
> > E.g. as per the series to add guest_memfd write() support[*]:
> >
> >   When direct map removal is implemented [2]
> >    - write() will not be allowed to access pages that have already
> >      been removed from direct map
> >    - on completion, write() will remove the populated pages from
> >      direct map
> >
> > That's pretty gross ABI, because with KVM_GMEM_FOLIO_NO_DIRECT_MAP, 
> > userspace can
> > write() exactly once.  To re-write memory, I assume userspace would need to 
> > do a
> > PUNCH_HOLE or truncate.
> >
> > What's preventing us from handling this automagically in e.g. 
> > filemap_add_folio()
> > and filemap_remove_folio()?  Then the usage rules are pretty 
> > straightforward: the
> > kernel must *always* assume the direct map is invalid for folios from
> > AS_NO_DIRECT_MAP mappings.
> >
> > Then if KVM needs to utilize a kernel mapping, e.g. in kvm_gmem_populate(), 
> > KVM
> > could use dedicated variants of kmap_local_xxx() to deal with a local 
> > mapping for
> > a folio/page without a direct map.  Or, KVM could simply disallow the 
> > specific
> > sequence that would require KVM to do the memcpy (I'm pretty sure we can do 
> > that
> > with in-place shared=>private conversion support).
> >
> > I realize that could throw a big wrench into write() performance, but IMO, 
> > before
> > merging either series, we need a complete story for exactly how this will 
> > all fit
> > together, in a maintainable fashion and with sane ABI.
>
> I agree with this - this approach would also allow for memory that was
> never in the direct map to begin with, or has been taken out already
> (for which I happen to have a use case :-)). guest_memfd and other
> code can then assume that AS_NO_DIRECT_MAP means they have to take
> explicit action to map it if needed. It's a clean, simple ABI.
>
> With the current set of patches, it seems like this couldn't be done
> in a clean manner.

Agreed with both of you.  I'll adopt the filemap-level approach:

- Move the zap/restore hooks from guest_memfd into filemap_add_folio()
  / filemap_remove_folio().
- Tighten AS_NO_DIRECT_MAP semantics so that, for folios in such a
  mapping, the direct map is invalid for the entire time the folio
  resides in the page cache.
- Drop the per-folio KVM_GMEM_FOLIO_NO_DIRECT_MAP bookkeeping in
  folio->private, since the existence of the folio in the mapping is
  itself the state.

On each guest memory population path,

- memcpy-based population from userspace goes through the userspace
  mapping of guest_memfd, not through the kernel direct map, so the
  filemap-level invariant doesn't affect it.  But this is slow, which
  is what motivated the write() syscall support.

- write(): meant to speed up the userspace-memcpy case above by doing
  the copy in the kernel.  I believe Brendan's __GFP_UNMAPPED/mermap
  work [1] would give us a low-overhead way to get temporary kernel
  access to an AS_NO_DIRECT_MAP.  Landing mermap may take a while, but
  this series does not introduce the write() path, so mermap is not a
  blocker for now.

- kvm_gmem_populate(): this is a TDX/SNP-only path, and NO_DIRECT_MAP
  is not available on those VM types —
  kvm_arch_gmem_supports_no_direct_map() returns false for
  KVM_X86_TDX_VM and KVM_X86_SNP_VM, which are its only callers
  today.  So it doesn't interact with the filemap invariant IIUC.

So, unless I'm missing any path, adopting the filemap-level approach in
this series should be fine.


I'd like to consult with you folks on how to proceed in advance.  In a
separate reply on the cover letter thread [2], Lorenzo and Sean
suggested that the mm pieces should go through the mm subsystem:

On Tue, Apr 21, 2026 at 04:36:00PM +0000, Sean Christopherson wrote:
> Yeah, when the time comes, the mm pieces definitely need to go through the mm
> tree.  Ideally, I think this would be merged in two separate parts, with all 
> mm
> changes going through the mm tree, and then the KVM changes through the KVM 
> tree
> using a stable topic branch/tag from Andrew.

I see two reasonable paths to get there, and would appreciate your
input on which you prefer:

Path A — validate on KVM side first, then split:
  - Post v13 as a single series on the KVM list, gather feedback and
    make sure the design is acceptable to KVM reviewers.
  - Once v13 looks good ("the time comes"), do the MM/KVM split,
    rebase the MM part onto the appropriate MM branch, and post the
    MM part to linux-mm to build consensus with MM maintainers.

Path B — split early and seek MM consensus in parallel:
  - With the filemap rework already in place, do the MM/KVM split
    now and post the MM part to linux-mm directly.  The KVM part follows
    on top of a stable topic from MM.

Which of the two would you rather see?  Happy to go either way.


[1] 
https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

Takahiro


Reply via email to