Hi Sean, Frank, Lorenzo,
On Tue, Apr 21, 2026 at 10:08:48AM -0700, Frank van der Linden wrote:
> On Tue, Apr 21, 2026 at 9:31 AM Sean Christopherson <[email protected]> wrote:
> > Making guest_memfd responsible for zapping and restoring the direct map on
> > a per-
> > folio basis feels wrong given the addition of AS_NO_DIRECT_MAP. I
> > especially don't
> > like that the "rules" for when an AS_NO_DIRECT_MAP folio has a direct map
> > will vary
> > based on the owner, and even within an owner (e.g. guest_memfd) will be ad
> > hoc.
> >
> > E.g. as per the series to add guest_memfd write() support[*]:
> >
> > When direct map removal is implemented [2]
> > - write() will not be allowed to access pages that have already
> > been removed from direct map
> > - on completion, write() will remove the populated pages from
> > direct map
> >
> > That's pretty gross ABI, because with KVM_GMEM_FOLIO_NO_DIRECT_MAP,
> > userspace can
> > write() exactly once. To re-write memory, I assume userspace would need to
> > do a
> > PUNCH_HOLE or truncate.
> >
> > What's preventing us from handling this automagically in e.g.
> > filemap_add_folio()
> > and filemap_remove_folio()? Then the usage rules are pretty
> > straightforward: the
> > kernel must *always* assume the direct map is invalid for folios from
> > AS_NO_DIRECT_MAP mappings.
> >
> > Then if KVM needs to utilize a kernel mapping, e.g. in kvm_gmem_populate(),
> > KVM
> > could use dedicated variants of kmap_local_xxx() to deal with a local
> > mapping for
> > a folio/page without a direct map. Or, KVM could simply disallow the
> > specific
> > sequence that would require KVM to do the memcpy (I'm pretty sure we can do
> > that
> > with in-place shared=>private conversion support).
> >
> > I realize that could throw a big wrench into write() performance, but IMO,
> > before
> > merging either series, we need a complete story for exactly how this will
> > all fit
> > together, in a maintainable fashion and with sane ABI.
>
> I agree with this - this approach would also allow for memory that was
> never in the direct map to begin with, or has been taken out already
> (for which I happen to have a use case :-)). guest_memfd and other
> code can then assume that AS_NO_DIRECT_MAP means they have to take
> explicit action to map it if needed. It's a clean, simple ABI.
>
> With the current set of patches, it seems like this couldn't be done
> in a clean manner.
Agreed with both of you. I'll adopt the filemap-level approach:
- Move the zap/restore hooks from guest_memfd into filemap_add_folio()
/ filemap_remove_folio().
- Tighten AS_NO_DIRECT_MAP semantics so that, for folios in such a
mapping, the direct map is invalid for the entire time the folio
resides in the page cache.
- Drop the per-folio KVM_GMEM_FOLIO_NO_DIRECT_MAP bookkeeping in
folio->private, since the existence of the folio in the mapping is
itself the state.
On each guest memory population path,
- memcpy-based population from userspace goes through the userspace
mapping of guest_memfd, not through the kernel direct map, so the
filemap-level invariant doesn't affect it. But this is slow, which
is what motivated the write() syscall support.
- write(): meant to speed up the userspace-memcpy case above by doing
the copy in the kernel. I believe Brendan's __GFP_UNMAPPED/mermap
work [1] would give us a low-overhead way to get temporary kernel
access to an AS_NO_DIRECT_MAP. Landing mermap may take a while, but
this series does not introduce the write() path, so mermap is not a
blocker for now.
- kvm_gmem_populate(): this is a TDX/SNP-only path, and NO_DIRECT_MAP
is not available on those VM types —
kvm_arch_gmem_supports_no_direct_map() returns false for
KVM_X86_TDX_VM and KVM_X86_SNP_VM, which are its only callers
today. So it doesn't interact with the filemap invariant IIUC.
So, unless I'm missing any path, adopting the filemap-level approach in
this series should be fine.
I'd like to consult with you folks on how to proceed in advance. In a
separate reply on the cover letter thread [2], Lorenzo and Sean
suggested that the mm pieces should go through the mm subsystem:
On Tue, Apr 21, 2026 at 04:36:00PM +0000, Sean Christopherson wrote:
> Yeah, when the time comes, the mm pieces definitely need to go through the mm
> tree. Ideally, I think this would be merged in two separate parts, with all
> mm
> changes going through the mm tree, and then the KVM changes through the KVM
> tree
> using a stable topic branch/tag from Andrew.
I see two reasonable paths to get there, and would appreciate your
input on which you prefer:
Path A — validate on KVM side first, then split:
- Post v13 as a single series on the KVM list, gather feedback and
make sure the design is acceptable to KVM reviewers.
- Once v13 looks good ("the time comes"), do the MM/KVM split,
rebase the MM part onto the appropriate MM branch, and post the
MM part to linux-mm to build consensus with MM maintainers.
Path B — split early and seek MM consensus in parallel:
- With the filemap rework already in place, do the MM/KVM split
now and post the MM part to linux-mm directly. The KVM part follows
on top of a stable topic from MM.
Which of the two would you rather see? Happy to go either way.
[1]
https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Takahiro