From: "Kiryl Shutsemau (Meta)" <[email protected]> Add an admin-guide section covering UFFDIO_REGISTER_MODE_RWP:
- sync and async fault models; - UFFDIO_RWPROTECT semantics; - UFFD_FEATURE_RWP_ASYNC; - UFFDIO_SET_MODE runtime mode flips. It also covers typical VMM working-set-tracking workflow from detection loop through sync-mode eviction and back to async. Signed-off-by: Kiryl Shutsemau <[email protected]> Assisted-by: Claude:claude-opus-4-6 --- Documentation/admin-guide/mm/userfaultfd.rst | 259 ++++++++++++++++++- 1 file changed, 253 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 1e533639fd50..69836e39aa0b 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -275,16 +275,16 @@ tracking and it can be different in a few ways: - Dirty information will not get lost if the pte was zapped due to various reasons (e.g. during split of a shmem transparent huge page). - - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit - set; dirty when uffd-wp bit cleared), it has different semantics on - some of the memory operations. For example: ``MADV_DONTNEED`` on + - Due to a reverted meaning of soft-dirty (page clean when the uffd bit + is set; dirty when the uffd bit is cleared), it has different semantics + on some of the memory operations. For example: ``MADV_DONTNEED`` on anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as - dirtying of memory by dropping uffd-wp bit during the procedure. + dirtying of memory by dropping the uffd bit during the procedure. The user app can collect the "written/dirty" status by looking up the -uffd-wp bit for the pages being interested in /proc/pagemap. +uffd bit for the pages being interested in /proc/pagemap. -The page will not be under track of uffd-wp async mode until the page is +The page will not be under track of userfaultfd-wp async mode until the page is explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault that was tracked by async mode userfaultfd-wp is invalid. @@ -307,6 +307,253 @@ transparent to the guest, we want that same address range to act as if it was still poisoned, even though it's on a new physical host which ostensibly doesn't have a memory error in the exact same spot. +Read-Write Protection +--------------------- + +``UFFDIO_REGISTER_MODE_RWP`` enables read-write protection tracking on a +memory range. It is similar to (but faster than) ``mprotect(PROT_NONE)`` +combined with a signal handler; unlike ``mprotect(PROT_NONE)``, RWP only +traps accesses to *present* PTEs, so accesses to unpopulated addresses in a +protected range fall through to the normal missing-page path. It uses the +PROT_NONE hinting mechanism (same as NUMA balancing) to make pages +inaccessible while keeping them resident in memory. Works on anonymous, +shmem, and hugetlbfs memory. + +RWP is designed for VM memory managers that need to track the working set +of guest memory for cold page eviction to tiered or remote storage. + +**Setup:** + +1. Open a userfaultfd and enable ``UFFD_FEATURE_RWP`` via ``UFFDIO_API``. + Optionally request ``UFFD_FEATURE_RWP_ASYNC`` as well — it requires + ``UFFD_FEATURE_RWP`` to be set in the same ``UFFDIO_API`` call. + +2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_RWP`` + (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be + fetched back from storage). + +**Feature availability:** + +RWP is built on top of two kernel primitives: a spare PTE bit owned by +userfaultfd (``CONFIG_HAVE_ARCH_USERFAULTFD_WP``) and architecture support +for present-but-inaccessible PTEs (``CONFIG_ARCH_HAS_PTE_PROTNONE``). When both +are available on a 64-bit kernel, the build selects +``CONFIG_USERFAULTFD_RWP=y`` and the ``VM_UFFD_RWP`` VMA flag becomes +available. + +``UFFD_FEATURE_RWP`` and ``UFFD_FEATURE_RWP_ASYNC`` are unavailable when +the running kernel or architecture does not support them — for example +32-bit kernels (where ``VM_UFFD_RWP`` is unavailable), kernels built +without ``CONFIG_USERFAULTFD_RWP``, and architectures whose ptes cannot +carry the uffd bit at runtime (e.g. riscv without the ``SVRSW60T59B`` +extension). Requesting an unsupported feature in +``uffdio_api.features`` makes ``UFFDIO_API`` fail with ``EINVAL`` and +leaves the userfaultfd context uninitialized; the bitmask returned in +``uffdio_api.features`` then advertises the features the kernel does +support. The recommended probe sequence is therefore to open a +throwaway userfaultfd, call ``UFFDIO_API`` once with ``features = 0``, +inspect the returned bitmask, close that fd, then open the real one +and call ``UFFDIO_API`` again with only the supported features set. + +**Protecting and Unprotecting:** + +Use ``UFFDIO_RWPROTECT`` to protect or unprotect a range, mirroring the +``UFFDIO_WRITEPROTECT`` interface:: + + struct uffdio_rwprotect rwp = { + .range = { .start = addr, .len = len }, + .mode = UFFDIO_RWPROTECT_MODE_RWP, /* protect */ + }; + ioctl(uffd, UFFDIO_RWPROTECT, &rwp); + +Setting ``UFFDIO_RWPROTECT_MODE_RWP`` sets PROT_NONE on present PTEs in the +range. Pages stay resident and their physical frames are preserved — only +access permissions are removed. + +Clearing ``UFFDIO_RWPROTECT_MODE_RWP`` restores normal VMA permissions and +wakes any faulting threads (unless ``UFFDIO_RWPROTECT_MODE_DONTWAKE`` is set). + +**Scope of protection:** + +RWP protection is a property of *present* PTEs. ``UFFDIO_RWPROTECT`` only +affects entries that are already populated. Unpopulated addresses within +the range remain unpopulated; when first accessed they fault through the +normal missing path (``do_anonymous_page()``, ``do_swap_page()``, +``finish_fault()``) and the resulting PTE is not RWP-protected. To observe +the population itself, co-register the range with +``UFFDIO_REGISTER_MODE_MISSING``. + +Protection is preserved across page reclaim: a page swapped out while +RWP-protected carries the marker on its swap entry, and swap-in restores +the PROT_NONE state so the first access after swap-in still faults. The +same applies to pages temporarily replaced by migration entries. + +Operations that drop the PTE entirely — ``MADV_DONTNEED`` on anonymous +memory, hole-punch on shmem, truncation of a file mapping — also drop the +RWP marker: the next access re-populates the range without protection. +Unlike WP (which persists via ``PTE_MARKER_UFFD_WP``), there is no +persistent RWP marker today. The user needs to re-arm the range with +``UFFDIO_RWPROTECT`` after any operation that explicitly frees PTEs. + +**Fault Handling:** + +When a protected page is accessed: + +- **Sync mode** (default): The faulting thread blocks and a + ``UFFD_PAGEFAULT_FLAG_RWP`` message is delivered to the userfaultfd + handler. The handler resolves the fault with ``UFFDIO_RWPROTECT`` + (clearing ``MODE_RWP``), which restores the PTE permissions and wakes + the faulting thread. + +- **Async mode** (``UFFD_FEATURE_RWP_ASYNC``): The kernel automatically + restores PTE permissions and the thread continues without blocking. No + message is delivered to the handler. + +**Runtime Mode Switching:** + +``UFFDIO_SET_MODE`` toggles ``UFFD_FEATURE_RWP_ASYNC`` at runtime, allowing +the VMM to switch between lightweight async detection and safe sync +eviction without re-registering. The toggle takes ``mmap_write_lock()`` +and calls ``vma_start_write()`` on each UFFD-armed VMA, draining +in-flight per-VMA-locked faults before the new mode takes effect. + +**Working-set detection with PAGEMAP_SCAN:** + +RWP-protected PTEs carry the uffd PTE bit; an access (and, in async mode, its +auto-resolution) clears it. ``PAGEMAP_SCAN`` reports ``PAGE_IS_ACCESSED`` once +the bit is clear on a ``VM_UFFD_RWP`` VMA, so a *non-inverted* scan reports the +pages that were touched during the interval -- the hot set:: + + struct pm_scan_arg arg = { + .size = sizeof(arg), + .start = guest_mem_start, + .end = guest_mem_end, + .vec = (uint64_t)regions, + .vec_len = regions_len, + .category_mask = PAGE_IS_ACCESSED, + .return_mask = PAGE_IS_ACCESSED, + }; + long n = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg); + +The returned ``page_region`` array lists the hot ranges. ``PAGE_IS_ACCESSED`` +is set on an accessed page whether it is still present or has since been +swapped out, so the hot scan needs no ``PAGE_IS_PRESENT`` filter -- unpopulated +holes carry neither bit and are excluded on their own. + +Track the hot set and reclaim everything else from the backing file (see the +workflow below). Do **not** invert the scan to enumerate "cold" pages +directly: an inverted scan reports only the ``VM_UFFD_RWP`` PTEs that are still +protected, i.e. the resident portion of *this* VMA. For a file mapping the +working set spans the whole file -- pages that live in the page cache but are +not mapped into this VMA (a pre-populated tmpfs file, or memory populated +through another mapping) are ``pte_none`` here, never appear in the scan, and +would never be considered for eviction even though they occupy memory. Driving +eviction from "file offsets minus the hot set" avoids that blind spot; a cold +PTE scan cannot. To additionally record the *first* access to a cached but +unmapped page (e.g. pre-populated content) as hot, co-register the range with +``UFFDIO_REGISTER_MODE_MINOR``: such accesses then fault as minor faults +instead of mapping the page silently. + +**Cleanup:** + +When the userfaultfd is closed or the range is unregistered, all PROT_NONE +PTEs are automatically restored to their normal VMA permissions. This +prevents pages from becoming permanently inaccessible. + +**VMM Working Set Tracking Workflow:** + +A typical VMM lifecycle for cold page eviction to tiered storage. Two +mappings of the same shmem (or hugetlbfs) file are used: ``guest_mem`` is +the RWP-registered mapping that vCPUs access through, and ``io_mem`` is a +private mapping for VMM-side I/O. Reading ``io_mem`` does not go through +the RWP-protected PTEs of ``guest_mem``, so the VMM's own ``pwrite()`` +never traps on its own :: + + /* One-time setup */ + fd = memfd_create("guest", MFD_CLOEXEC); + ftruncate(fd, guest_size); + guest_mem = mmap(NULL, guest_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); /* vCPU view, RWP-registered */ + io_mem = mmap(NULL, guest_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); /* VMM I/O view, unprotected */ + + uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK); + struct uffdio_api api = { + .api = UFFD_API, + .features = UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }; + ioctl(uffd, UFFDIO_API, &api); + if (!(api.features & UFFD_FEATURE_RWP)) + /* RWP unavailable on this kernel/arch -- fall back. */ + ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){ + .range = { guest_mem, guest_size }, + .mode = UFFDIO_REGISTER_MODE_RWP | + UFFDIO_REGISTER_MODE_MISSING, + }); + + /* Tracking loop */ + while (vm_running) { + /* 1. Detection phase (async -- no vCPU stalls) */ + ioctl(uffd, UFFDIO_RWPROTECT, &(struct uffdio_rwprotect){ + .range = full_range, + .mode = UFFDIO_RWPROTECT_MODE_RWP }); + sleep(tracking_interval); + + /* + * 2. Switch to sync BEFORE scanning. In async mode a vCPU + * access races eviction: it would auto-resolve and mark the + * page hot just as the VMM writes it out and punches it, + * losing the update. Sync mode makes such accesses block and + * be delivered, freezing the hot snapshot for the rest of the + * iteration. + */ + ioctl(uffd, UFFDIO_SET_MODE, + &(struct uffdio_set_mode){ + .disable = UFFD_FEATURE_RWP_ASYNC }); + + /* 3. Read the hot set: pages touched this interval. */ + ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){ + .category_mask = PAGE_IS_ACCESSED, + .return_mask = PAGE_IS_ACCESSED, + ... + }); + + /* + * 4. Reclaim the file offsets that are NOT in the hot set. + * Driving this from the file's offset space (rather than from a + * cold PTE scan) also reclaims pages that are cached but not + * mapped into guest_mem, e.g. pre-populated content. + */ + for each non-hot offset range: + /* Read from io_mem -- bypasses RWP, no fault. */ + pwrite(storage_fd, (char *)io_mem + off, len, off); + /* Drop the page from the shared file. */ + fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + off, len); + /* + * Wake any vCPU blocked on the RWP fault for this range: + * fallocate() does not iterate ctx->fault_pending_wqh. + */ + ioctl(uffd, UFFDIO_WAKE, &(struct uffdio_range){ + .start = (uintptr_t)guest_mem + off, .len = len }); + + /* 5. Resume async tracking */ + ioctl(uffd, UFFDIO_SET_MODE, + &(struct uffdio_set_mode){ + .enable = UFFD_FEATURE_RWP_ASYNC }); + } + +During step 4, a vCPU that accesses a ``guest_mem`` offset being evicted +blocks with a ``UFFD_PAGEFAULT_FLAG_RWP`` fault while the eviction is in +progress. After ``fallocate()`` punches the page out and ``UFFDIO_WAKE`` +fires, the vCPU retries the access, faults as ``MISSING``, and the +handler resolves it with ``UFFDIO_COPY`` from storage. + +This workflow targets shmem and hugetlbfs (both support a private +``io_mem`` mapping over the same fd). Anonymous-memory backings need a +different inner-loop strategy because the VMM has no way to read the +page without going through the RWP-protected mapping. + QEMU/KVM ======== -- 2.54.0

