On Thu, Apr 23, 2026 at 04:10:30PM -0400, Peter Xu wrote: > On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: > > > > > > The other thing is, as I mentioned in the other email, I still don't know > > > how the current RW protection would work for anonymous. I don't yet think > > > the user swapper can read the anon page with RW-protected pgtables. So > > > far > > > my understanding is maybe you only care about shmem so it's fine, but > > > it'll > > > always be great to confirm with you.
That's true. We use vhost and therefore shmem in our setup. One idea I had about how to make atomic eviction for anon is extending process_vm_read() and process_madvise(): - Add a flag to process_vm_read() to bypass the protnone check on accessible (or only RWP?) VMAs. - Allow process_madvise(MADV_DONTNEED) when the caller already has ptrace write access to the target. The standing objection to remote DONTNEED has been "destructive", but process_vm_writev() already lets a ptrace-capable caller overwrite arbitrary anon with attacker-chosen content. DONTNEED is strictly weaker — it zeroes, it does not inject — so the trust model is already established. > > I wonder if uffdio_move could be used for a swapper implementation instead? I considered it. UFFDIO_MOVE can in principle relocate the cold folio into a staging VMA inside the VMM, which then reads it and drops it. The downside is the VMM has to maintain a second address range and serialise eviction through it. A purpose-built primitive — something like UFFDIO_EVICT that zaps the PTE and returns the folio contents (optionally to an fd for io_uring) — seems cleaner. > If RW is justified to be useful first, maybe. > > I had a gut feeling Kirill's use case doesn't use anon at all, then if > nobody needs it we can still decide to not support anon. > > > > > If we ever have to read from a protnone page, maybe we could teach ptrace > > access > > to do it, or have something that can read from prot_none areas -- like > > uffdio_copy, which can write to prot-none areas. > > Somethinig like swap_access() in my proposal can also partly achieve that. > > https://lore.kernel.org/all/[email protected]/ A maccess()-style primitive that reads through PROT_NONE is a reasonable building block and overlaps with part of what UFFDIO_EVICT would need. > There, it was only about reading from swap so far, though. But that one > might be easier to be extended to read PROT_NONE and directly put data into > buffer user specified (ps: in my local tree impl I named it maccess() to > pair with mincore(), but it doesn't really matter; it doesn't even need to > be a syscall..). > > To me, the interfacing is not a major issue. The major question I have is > why RW protection can help in swap system impl when we already have uffd-wp. > > So I want to make sure the use case can't be implemented by uffd-wp already. > Because that's really what we might do for QEMU. Race-free eviction can definitely be implemented with uffd-wp already. But not proper working set discovery. -- Kiryl Shutsemau / Kirill A. Shutemov

