On 02/04/2026 23:05, Sean Christopherson wrote:
On Thu, Apr 02, 2026, Mike Rapoport wrote:
From: Nikita Kalyazin <[email protected]>
userfaultfd notifications about page faults used for live
migration and snapshotting of VMs.
MISSING mode allows post-copy live migration and MINOR mode allows
optimization for post-copy live migration for VMs backed with
shared hugetlbfs or tmpfs mappings as described in detail in
commit 7677f7fd8be7 ("userfaultfd: add minor fault registration
mode").
To use the same mechanisms for VMs that use guest_memfd to map
their memory, guest_memfd should support userfaultfd operations.
Add implementation of vm_uffd_ops to guest_memfd.
Signed-off-by: Nikita Kalyazin <[email protected]> Co-developed-
by: Mike Rapoport (Microsoft) <[email protected]> Signed-off-by:
Mike Rapoport (Microsoft) <[email protected]> --- mm/
filemap.c | 1 + virt/kvm/guest_memfd.c | 84 ++++++++++++
+++++++++++++++++++++++++++++- 2 files changed, 83 insertions(+),
2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index
406cef06b684..a91582293118 100644 --- a/mm/filemap.c +++ b/mm/
filemap.c @@ -262,6 +262,7 @@ void filemap_remove_folio(struct
folio *folio)
filemap_free_folio(mapping, folio); }
+EXPORT_SYMBOL_FOR_MODULES(filemap_remove_folio, "kvm");
This can be EXPORT_SYMBOL_FOR_KVM so that the symbol is exported if
and only if KVM is built as a module.
/* * page_cache_delete_batch - delete several folios from page
cache diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 017d84a7adf3..46582feeed75 100644 --- a/virt/kvm/
guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@
#include <linux/mempolicy.h> #include <linux/pseudo_fs.h> #include
<linux/pagemap.h> +#include <linux/userfaultfd_k.h>
#include "kvm_mm.h"
@@ -107,6 +108,12 @@ static int kvm_gmem_prepare_folio(struct kvm
*kvm, struct kvm_memory_slot *slot, return
__kvm_gmem_prepare_folio(kvm, slot, index, folio); }
+static struct folio *kvm_gmem_get_folio_noalloc(struct inode
*inode, pgoff_t pgoff) +{ + return __filemap_get_folio(inode-
>i_mapping, pgoff, + FGP_LOCK |
FGP_ACCESSED, 0);
Note, this will conflict with commit 6dad5447c7bf ("KVM:
guest_memfd: Don't set FGP_ACCESSED when getting folios") sitting in
https://github.com/kvm-x86/linux.git gmem
I think the resolution is to just end up with:
static struct folio *kvm_gmem_get_folio_noalloc(struct inode *inode,
pgoff_t pgoff) { return filemap_lock_folio(inode->i_mapping, pgoff);
}
However, I think that'll be a moot point in the end (the conflict
will be avoided). More below.
+} + /* * Returns a locked folio on success. The caller is
responsible for * setting the up-to-date flag before the memory is
mapped into the guest. @@ -126,8 +133,7 @@ static struct folio
*kvm_gmem_get_folio(struct inode *inode, pgoff_t index) * Fast-
path: See if folio is already present in mapping to avoid *
policy_lookup. */ - folio = __filemap_get_folio(inode-
>i_mapping, index, - FGP_LOCK |
FGP_ACCESSED, 0); + folio = kvm_gmem_get_folio_noalloc(inode,
index); if (!IS_ERR(folio)) return folio;
@@ -457,12 +463,86 @@ static struct mempolicy
*kvm_gmem_get_policy(struct vm_area_struct *vma, } #endif /*
CONFIG_NUMA */
+#ifdef CONFIG_USERFAULTFD +static bool
kvm_gmem_can_userfault(struct vm_area_struct *vma, vm_flags_t
vm_flags) +{ + struct inode *inode = file_inode(vma->vm_file);
+ + /* + * Only support userfaultfd for guest_memfd with
INIT_SHARED flag. + * This ensures the memory can be mapped
to userspace. + */ + if (!(GMEM_I(inode)->flags &
GUEST_MEMFD_FLAG_INIT_SHARED)) + return false;
I'm not comfortable with this change. It works for now, but it's
going to be wildly wrong when in-place conversion comes along.
While I agree with the "Let's solve each problem in it's
time :)"[*], the time for in-place conversion is now. In-place
conversion isn't landing this cycle or next, but it's been in
development for longer than UFFD support, and I'm not willing to
punt solvable problems to that series, because it's plenty fat as
is.
Happily, IIUC, this is an easy problem to solve, and will have a
nice side effect for the common UFFD code.
My objection to an early, global "can_userfault()" check is that
it's guaranteed to cause TOCTOU issues. E.g. for VM_UFFD_MISSING
and VM_UFFD_MINOR, the check on whether or not a given address can
be faulted in needs to happen in __do_userfault(), not broadly when
VM_UFFD_MINOR is added to a VMA. Conceptually, that also better
aligns the code with the "normal" user fault path in
kvm_gmem_fault_user_mapping().
I'm definitely not asking to fully prep for in-place conversion, I
just want to set us up for success and also to not have to churn a
pile of code. Concretely, again IIUC, I think we just need to move
the INIT_SHARED check to ->alloc_folio() and ->get_folio_noalloc().
And if we extract kvm_gmem_is_shared_mem() now instead of waiting
for in-place conversion, then we'll avoid a small amount of churn
when in-place conversion comes along.
The bonus side effect is that dropping guest_memfd's more "complex"
can_userfault means the only remaining check is constant based on
the backing memory vs. the UFFD flags. If we want, the indirect
call to a function can be replace with a constant vm_flags_t
variable that enumerates the supported (or unsupported if we're
feeling negative) flags, e.g.
Thanks Sean. Checking for GUEST_MEMFD_FLAG_INIT_SHARED at the time of
use and adding uffd_ prefixes to the callbacks make sense to me. I
tested your changes in my local setup and they are functional with minor
tweaks:
- remove the no-longer-used anon_can_userfault
- fix for the vm_flags check
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 100aeadd7180..df91c40c6281 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -32,14 +32,6 @@ struct mfill_state {
pmd_t *pmd;
};
-static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t
vm_flags)
-{
- /* anonymous memory does not support MINOR mode */
- if (vm_flags & VM_UFFD_MINOR)
- return false;
- return true;
-}
-
static struct folio *anon_alloc_folio(struct vm_area_struct *vma,
unsigned long addr)
{
@@ -2051,7 +2043,7 @@ bool vma_can_userfault(struct vm_area_struct *vma,
vm_flags_t vm_flags,
!ops->get_folio_noalloc)
return false;
- return ops->supported_uffd_flags & vm_flags;
+ return (ops->supported_uffd_flags & vm_flags) == vm_flags;
}
static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
diff --git a/include/linux/userfaultfd_k.h b/include/linux/
userfaultfd_k.h index 6f33307c2780..8a2d0625ffa3 100644 --- a/
include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@
-82,8 +82,8 @@ extern vm_fault_t handle_userfault(struct vm_fault
*vmf, unsigned long reason);
/* VMA userfaultfd operations */ struct vm_uffd_ops { - /*
Checks if a VMA can support userfaultfd */ - bool
(*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
+ /* What UFFD flags/modes are supported. */ + const
vm_flags_t supported_uffd_flags; /* * Called to resolve
UFFDIO_CONTINUE request. * Should return the folio found at pgoff in
the VMA's pagecache if it
with usage like:
static const struct vm_uffd_ops shmem_uffd_ops = {
.supported_uffd_flags = __VM_UFFD_FLAGS, .get_folio_noalloc =
shmem_get_folio_noalloc, .alloc_folio =
shmem_mfill_folio_alloc, .filemap_add =
shmem_mfill_filemap_add, .filemap_remove =
shmem_mfill_filemap_remove, };
[*] https://lore.kernel.org/all/[email protected]
+ return true; +}
...
+static const struct vm_uffd_ops kvm_gmem_uffd_ops = {
+ .can_userfault = kvm_gmem_can_userfault,
+ .get_folio_noalloc = kvm_gmem_get_folio_noalloc,
+ .alloc_folio = kvm_gmem_folio_alloc,
+ .filemap_add = kvm_gmem_filemap_add,
+ .filemap_remove = kvm_gmem_filemap_remove,
Please use kvm_gmem_uffd_xxx(). The names are a bit verbose, but
these are waaay to generic of names as-is, e.g.
kvm_gmem_folio_alloc() has implications and restrictions far beyond
just allocating a folio.
All in all, somelike like so (completely untested):
--- include/linux/userfaultfd_k.h | 4 +- mm/
filemap.c | 1 + mm/hugetlb.c | 8
+--- mm/shmem.c | 7 +-- mm/
userfaultfd.c | 6 +-- virt/kvm/guest_memfd.c |
80 ++++++++++++++++++++++++++++++++++- 6 files changed, 87
insertions(+), 19 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/
userfaultfd_k.h index 6f33307c2780..8a2d0625ffa3 100644 --- a/
include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@
-82,8 +82,8 @@ extern vm_fault_t handle_userfault(struct vm_fault
*vmf, unsigned long reason);
/* VMA userfaultfd operations */ struct vm_uffd_ops { - /*
Checks if a VMA can support userfaultfd */ - bool
(*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
+ /* What UFFD flags/modes are supported. */ + const
vm_flags_t supported_uffd_flags; /* * Called to resolve
UFFDIO_CONTINUE request. * Should return the folio found at pgoff in
the VMA's pagecache if it diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4ada..19dfcebcd23f 100644 --- a/mm/filemap.c +++ b/mm/
filemap.c @@ -262,6 +262,7 @@ void filemap_remove_folio(struct folio
*folio)
filemap_free_folio(mapping, folio); }
+EXPORT_SYMBOL_FOR_MODULES(filemap_remove_folio, "kvm");
/* * page_cache_delete_batch - delete several folios from page cache
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index
077968a8a69a..f55857961adb 100644 --- a/mm/hugetlb.c +++ b/mm/
hugetlb.c @@ -4819,14 +4819,8 @@ static vm_fault_t
hugetlb_vm_op_fault(struct vm_fault *vmf) }
#ifdef CONFIG_USERFAULTFD -static bool hugetlb_can_userfault(struct
vm_area_struct *vma, - vm_flags_t
vm_flags) -{ - return true; -} - static const struct
vm_uffd_ops hugetlb_uffd_ops = { - .can_userfault =
hugetlb_can_userfault, + .supported_uffd_flags =
__VM_UFFD_FLAGS, }; #endif
diff --git a/mm/shmem.c b/mm/shmem.c index
239545352cd2..76d8488b9450 100644 --- a/mm/shmem.c +++ b/mm/shmem.c
@@ -3250,13 +3250,8 @@ static struct folio
*shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff) return
folio; }
-static bool shmem_can_userfault(struct vm_area_struct *vma,
vm_flags_t vm_flags) -{ - return true; -} - static const
struct vm_uffd_ops shmem_uffd_ops = {
- .can_userfault = shmem_can_userfault,
+ .supported_uffd_flags = __VM_UFFD_FLAGS,
.get_folio_noalloc = shmem_get_folio_noalloc,
.alloc_folio = shmem_mfill_folio_alloc,
.filemap_add = shmem_mfill_filemap_add, diff --git a/mm/
userfaultfd.c b/mm/userfaultfd.c index 9ba6ec8c0781..ccbd7bb334c2
100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -58,8 +58,8
@@ static struct folio *anon_alloc_folio(struct vm_area_struct *vma,
}
static const struct vm_uffd_ops anon_uffd_ops = {
- .can_userfault = anon_can_userfault,
- .alloc_folio = anon_alloc_folio,
+ .supported_uffd_flags = __VM_UFFD_FLAGS & ~VM_UFFD_MINOR,
+ .alloc_folio = anon_alloc_folio, };
static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct
*vma) @@ -2055,7 +2055,7 @@ bool vma_can_userfault(struct
vm_area_struct *vma, vm_flags_t vm_flags, !ops->get_folio_noalloc)
return false;
- return ops->can_userfault(vma, vm_flags); + return ops-
>supported_uffd_flags & vm_flags; }
static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index
462c5c5cb602..e634bf671d12 100644 --- a/virt/kvm/guest_memfd.c +++ b/
virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@ #include <linux/mempolicy.h>
#include <linux/pseudo_fs.h> #include <linux/pagemap.h> +#include
<linux/userfaultfd_k.h>
#include "kvm_mm.h"
@@ -59,6 +60,11 @@ static pgoff_t kvm_gmem_get_index(struct
kvm_memory_slot *slot, gfn_t gfn) return gfn - slot->base_gfn + slot-
>gmem.pgoff; }
+static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t
index) +{ + return GMEM_I(inode)->flags &
GUEST_MEMFD_FLAG_INIT_SHARED; +} + static int
__kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot
*slot, pgoff_t index, struct folio *folio) { @@ -396,7 +402,7 @@
static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) return
VM_FAULT_SIGBUS;
- if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
+ if (!kvm_gmem_is_shared_mem(inode, vmf->pgoff)) return
VM_FAULT_SIGBUS;
folio = kvm_gmem_get_folio(inode, vmf->pgoff); @@ -456,12 +462,84 @@
static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct
*vma, } #endif /* CONFIG_NUMA */
+#ifdef CONFIG_USERFAULTFD +static struct folio
*kvm_gmem_uffd_get_folio_noalloc(struct inode *inode,
+ pgoff_t pgoff)
+{ + if (!kvm_gmem_is_shared_mem(inode, pgoff))
+ return NULL; + + return
filemap_lock_folio(inode->i_mapping, pgoff); +} + +static struct
folio *kvm_gmem_uffd_folio_alloc(struct vm_area_struct *vma,
+ unsigned long addr)
+{ + struct inode *inode = file_inode(vma->vm_file); +
pgoff_t pgoff = linear_page_index(vma, addr); + struct
mempolicy *mpol; + struct folio *folio; + gfp_t gfp; +
+ if (unlikely(pgoff >= (i_size_read(inode) >> PAGE_SHIFT)))
+ return NULL; + + if (!
kvm_gmem_is_shared_mem(inode, pgoff)) + return NULL; +
+ gfp = mapping_gfp_mask(inode->i_mapping); + mpol =
mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff); +
mpol = mpol ?: get_task_policy(current); + folio =
filemap_alloc_folio(gfp, 0, mpol); + mpol_cond_put(mpol); +
+ return folio; +} + +static int
kvm_gmem_uffd_filemap_add(struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned long addr) +{ +
struct inode *inode = file_inode(vma->vm_file); + struct
address_space *mapping = inode->i_mapping; + pgoff_t pgoff =
linear_page_index(vma, addr); + int err; + +
__folio_set_locked(folio); + err = filemap_add_folio(mapping,
folio, pgoff, GFP_KERNEL); + if (err) { +
folio_unlock(folio); + return err; + } + +
return 0; +} + +static void kvm_gmem_uffd_filemap_remove(struct
folio *folio, + struct
vm_area_struct *vma) +{ + filemap_remove_folio(folio); +
folio_unlock(folio); +} + +static const struct vm_uffd_ops
kvm_gmem_uffd_ops = { + .supported_uffd_flags =
__VM_UFFD_FLAGS, + .get_folio_noalloc =
kvm_gmem_uffd_get_folio_noalloc, + .alloc_folio =
kvm_gmem_uffd_folio_alloc, + .filemap_add =
kvm_gmem_uffd_filemap_add, + .filemap_remove =
kvm_gmem_uffd_filemap_remove, +}; +#endif /* CONFIG_USERFAULTFD */ +
static const struct vm_operations_struct kvm_gmem_vm_ops = {
.fault = kvm_gmem_fault_user_mapping, #ifdef CONFIG_NUMA
.get_policy = kvm_gmem_get_policy, .set_policy =
kvm_gmem_set_policy, #endif +#ifdef CONFIG_USERFAULTFD
+ .uffd_ops = &kvm_gmem_uffd_ops, +#endif };
static int kvm_gmem_mmap(struct file *file, struct vm_area_struct
*vma)
base-commit: d63beb006dba56d5fa219f106c7a97eb128c356f --