On 05/03/2026 19:18, David Hildenbrand (Arm) wrote:
On 1/26/26 17:50, Kalyazin, Nikita wrote:
From: Patrick Roy <[email protected]>

Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.

To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.

Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated.  In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.

Direct map entries are zapped right before guest or userspace mappings
of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
a gmem folio can be allocated without being mapped anywhere is
kvm_gmem_populate(), where handling potential failures of direct map
removal is not possible (by the time direct map removal is attempted,
the folio is already marked as prepared, meaning attempting to re-try
kvm_gmem_populate() would just result in -EEXIST without fixing up the
direct map state). These folios are then removed form the direct map
upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.

Signed-off-by: Patrick Roy <[email protected]>
Signed-off-by: Nikita Kalyazin <[email protected]>
---
  Documentation/virt/kvm/api.rst  | 21 +++++----
  arch/x86/include/asm/kvm_host.h |  5 +--
  arch/x86/kvm/x86.c              |  5 +++
  include/linux/kvm_host.h        | 12 +++++
  include/uapi/linux/kvm.h        |  1 +
  virt/kvm/guest_memfd.c          | 80 ++++++++++++++++++++++++++++++---
  6 files changed, 106 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 01a3abef8abb..c5ee43904bca 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6440,15 +6440,18 @@ a single guest_memfd file, but the bound ranges must 
not overlap).
  The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be
  specified via KVM_CREATE_GUEST_MEMFD.  Currently defined flags:

-  ============================ ================================================
-  GUEST_MEMFD_FLAG_MMAP        Enable using mmap() on the guest_memfd file
-                               descriptor.
-  GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
-                               KVM_CREATE_GUEST_MEMFD (memory files created
-                               without INIT_SHARED will be marked private).
-                               Shared memory can be faulted into host userspace
-                               page tables. Private memory cannot.
-  ============================ ================================================
+  ============================== 
================================================
+  GUEST_MEMFD_FLAG_MMAP          Enable using mmap() on the guest_memfd file
+                                 descriptor.
+  GUEST_MEMFD_FLAG_INIT_SHARED   Make all memory in the file shared during
+                                 KVM_CREATE_GUEST_MEMFD (memory files created
+                                 without INIT_SHARED will be marked private).
+                                 Shared memory can be faulted into host 
userspace
+                                 page tables. Private memory cannot.
+  GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will unmap the memory
+                                 backing it from the kernel's address space
+                                 before passing it off to userspace or the 
guest.
+  ============================== 
================================================

  When the KVM MMU performs a PFN lookup to service a guest fault and the 
backing
  guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 68bd29a52f24..6de1c3a6344f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2483,10 +2483,7 @@ static inline bool kvm_arch_has_irq_bypass(void)
  }

  #ifdef CONFIG_KVM_GUEST_MEMFD
-static inline bool kvm_arch_gmem_supports_no_direct_map(void)
-{
-     return can_set_direct_map();
-}
+bool kvm_arch_gmem_supports_no_direct_map(struct kvm *kvm);

It's odd given that you introduced that code two patches previously. Can
these changes directly be squashed into the earlier patch?
[...]

You're right, I'll pull it in the "KVM: x86: define kvm_arch_gmem_supports_no_direct_map()".



+#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
+
+static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
+{
+     return ((u64)folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
+}
+
+static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
+{
+     u64 gmem_flags = GMEM_I(folio_inode(folio))->flags;
+     int r = 0;
+
+     if (kvm_gmem_folio_no_direct_map(folio) || !(gmem_flags & 
GUEST_MEMFD_FLAG_NO_DIRECT_MAP))
+             goto out;
+
+     folio->private = (void *)((u64)folio->private | 
KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+     r = folio_zap_direct_map(folio);

And if it fails, you'd leave KVM_GMEM_FOLIO_NO_DIRECT_MAP set.

What about modifying ->private only if it really worked?

True. I'll do

        r = folio_zap_direct_map(folio);
        if (!r)
folio->private = (void *)((u64)folio->private | KVM_GMEM_FOLIO_NO_DIRECT_MAP);


+
+out:
+     return r;
+}
+
+static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
+{
+     /*
+      * Direct map restoration cannot fail, as the only error condition
+      * for direct map manipulation is failure to allocate page tables
+      * when splitting huge pages, but this split would have already
+      * happened in folio_zap_direct_map() in kvm_gmem_folio_zap_direct_map().
+      * Note that the splitting occurs always because guest_memfd
+      * currently supports only base pages.
+      * Thus folio_restore_direct_map() here only updates prot bits.
+      */
+     WARN_ON_ONCE(folio_restore_direct_map(folio));

Which raised the question: why should this function then even return an
error?

Dave pointed earlier that the failures were possible [1]. Do you think we can document it better?

[1] https://lore.kernel.org/kvm/[email protected]/



+     folio->private = (void *)((u64)folio->private & 
~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+}
+
  static inline void kvm_gmem_mark_prepared(struct folio *folio)
  {
       folio_mark_uptodate(folio);
@@ -393,11 +433,17 @@ static bool kvm_gmem_supports_mmap(struct inode *inode)
       return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_MMAP;
  }

+static bool kvm_gmem_no_direct_map(struct inode *inode)
+{
+     return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+}
+
  static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
  {
       struct inode *inode = file_inode(vmf->vma->vm_file);
       struct folio *folio;
       vm_fault_t ret = VM_FAULT_LOCKED;
+     int err;

       if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
               return VM_FAULT_SIGBUS;
@@ -423,6 +469,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct 
vm_fault *vmf)
               kvm_gmem_mark_prepared(folio);
       }

+     if (kvm_gmem_no_direct_map(folio_inode(folio))) {
+             err = kvm_gmem_folio_zap_direct_map(folio);
+             if (err) {
+                     ret = vmf_error(err);
+                     goto out_folio;
+             }
+     }
+
       vmf->page = folio_file_page(folio, vmf->pgoff);

  out_folio:
@@ -533,6 +587,9 @@ static void kvm_gmem_free_folio(struct folio *folio)
       kvm_pfn_t pfn = page_to_pfn(page);
       int order = folio_order(folio);

+     if (kvm_gmem_folio_no_direct_map(folio))
+             kvm_gmem_folio_restore_direct_map(folio);
+
       kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
  }

@@ -596,6 +653,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, 
u64 flags)
       /* Unmovable mappings are supposed to be marked unevictable as well. */
       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));

+     if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
+             mapping_set_no_direct_map(inode->i_mapping);
+
       GMEM_I(inode)->flags = flags;

       file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, 
&kvm_gmem_fops);
@@ -804,15 +864,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
kvm_memory_slot *slot,
       if (IS_ERR(folio))
               return PTR_ERR(folio);

-     if (!is_prepared)
+     if (!is_prepared) {
               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
+             if (r)
+                     goto out_unlock;
+     }
+
+     if (kvm_gmem_no_direct_map(folio_inode(folio))) {
+             r = kvm_gmem_folio_zap_direct_map(folio);
+             if (r)
+                     goto out_unlock;
+     }


It's a bit nasty that we have two different places where we have to call
this. Smells error prone.

We will actually have 2 more: for the write() syscall and UFFDIO_COPY, and 0 once we have [2]

[2] https://lore.kernel.org/linux-mm/[email protected]/


I was wondering why kvm_gmem_get_folio() cannot handle that?

Most of the call sites follow the pattern alloc -> write -> zap so they'll need direct map for some time after the allocation.


Then also fallocate() would directly be handled directly, instead of
later at fault time etc.

Good question about fallocate(). It's not apparent to me that it needs to remove pages from direct map because we may not be able to initisalise them later on if we do.


Is it because __kvm_gmem_populate() etc need to write to this page?

I think it also applies to write(), UFFDIO_COPY and kvm_gmem_fault_user_mapping().



--
Cheers,

David


Reply via email to