sashiko.dev -- 
https://sashiko.dev/#/patchset/[email protected] -- wrote:
> > @@ -1084,9 +1092,29 @@ static __always_inline void 
> > __copy_present_ptes(struct vm_area_struct *dst_vma,
> >             pte_t pte, unsigned long addr, int nr)
> >  {
> >     struct mm_struct *src_mm = src_vma->vm_mm;
> > +   bool writable;
> > +
> > +   /*
> > +    * Snapshot writability before the RWP-disarm rewrite below: when the
> > +    * child is not RWP-armed, pte_modify(pte, dst_vma->vm_page_prot) can
> > +    * silently drop _PAGE_RW from a resolved (no-marker) writable PTE,
> > +    * so a later pte_write(pte) check would skip the COW wrprotect and
> > +    * leave the parent writable over a folio shared with the child.
> > +    */
> > +   writable = pte_write(pte);
> > +
> > +   /*
> > +    * Child is not RWP-armed: restore accessible protection so the
> > +    * inherited PAGE_NONE does not cost a fault on first read.
> > +    */
> > +   if (!userfaultfd_protected(dst_vma)) {
> > +           if (userfaultfd_rwp(src_vma))
> > +                   pte = pte_modify(pte, dst_vma->vm_page_prot);
> > +           pte = pte_clear_uffd(pte);
> > +   }
> Does this unconditional pte_modify() create invalid clean and writable PTEs
> for shared mappings?
>
> Without checking pte_uffd(pte) first, this blindly modifies every present PTE
> if the source VMA had RWP enabled. For shared writable mappings, vm_page_prot
> includes _PAGE_RW. If a PTE was clean and mapped read-only to intercept the
> first write for filesystem dirty-tracking, pte_modify() forces the write bit
> on while preserving the clean state.

Two reasons this is safe in practice:

1. RWP cannot reach a backing that uses the clean-RO + page_mkwrite()
   dirty-tracking pattern. vma_can_userfault() admits RWP only on anon,
   shmem and hugetlbfs (the ones with vm_uffd_ops); regular file-backed
   shared mappings are rejected at register time.

2. For backings that *do* support shared writable + dirty tracking (i.e.
   the ones RWP isn't enabled on), the fs sets vma->vm_page_prot to RO
   precisely so the first write traps page_mkwrite(). pte_modify(pte,
   dst_vma->vm_page_prot) would not add _PAGE_RW in that configuration
   either.

> Hardware writes could then bypass page_mkwrite(), leaving the filesystem
> unaware that the page is dirty, and potentially causing silent data loss
> upon reclaim.
> Could this also overwrite other important PTE states like NUMA hinting
> or soft-dirty tracking?

pte_modify() on the supported architectures preserves _PAGE_CHG_MASK,
which covers _PAGE_DIRTY, _PAGE_ACCESSED, _PAGE_SOFT_DIRTY, and the PFN.
NUMA hinting is encoded in the protection bits, so a transient hint
state on a resolved (no-marker) PTE would be lost across this rewrite,
but that's just a re-prime on the next NUMA scan, not a correctness
issue.

> @@ -298,6 +298,16 @@ static __always_inline void change_present_ptes(struct 
> mmu_gather *tlb,
>      else if (uffd_wp_resolve || uffd_rwp_resolve)
>          ptent = pte_clear_uffd(ptent);
>
> +    /*
> +     * The uffd bit on a VM_UFFD_RWP VMA carries PROT_NONE
> +     * semantics. If mprotect() or NUMA hinting changed the
> +     * base protection, restore PAGE_NONE so the PTE still
> +     * traps on any access. pte_modify() preserves
> +     * _PAGE_UFFD.
> +     */
> +    if (userfaultfd_rwp(vma) && pte_uffd(ptent))
> +        ptent = pte_modify(ptent, PAGE_NONE);
>
> For VMAs registered with both UFFD WP and RWP, could this silently upgrade
> Write-Protected pages to Read-Write-Protected?

VM_UFFD_WP and VM_UFFD_RWP are mutually exclusive per VMA. The combination
is rejected at registration time:

  fs/userfaultfd.c, userfaultfd_register():
        if ((vm_flags & VM_UFFD_WP) && (vm_flags & VM_UFFD_RWP))
                goto out;       /* -EINVAL */

and re-registration cannot drop either bit (also -EBUSY in the same
function), so a PTE in a WP-armed VMA can never satisfy
userfaultfd_rwp(vma); the new branch never fires there.

[ ... ]
> +        /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */
> +        if (userfaultfd_rwp(dst_vma)) {
> +            orig_dst_pte = pte_modify(orig_dst_pte, PAGE_NONE);
> +            orig_dst_pte = pte_mkuffd(orig_dst_pte);
> +        }
> +
>          set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
>
> Could applying PAGE_NONE unconditionally to newly moved pages trap
> applications in an infinite fault loop?

No -- the post-MOVE access is delivered as a normal RWP fault. In sync
mode it goes to the registered handler, which resolves it with
UFFDIO_RWPROTECT clearing MODE_RWP; in async mode the kernel resolves
it in-kernel and the faulting thread continues. There is no loop.

The semantics here are intentional: a VM_UFFD_RWP VMA has the contract
that every present PTE is either an active marker or a tracked-and-
resolved PTE whose next access will re-trap. UFFDIO_MOVE into such a
VMA must keep that contract, otherwise the moved-in page would be a
silent hole in the working-set view. UFFDIO_MOVE has no mode flag for
"skip protection", by design -- the same way it has no flag to skip
WP arming if dst_vma were WP-armed (and the equivalent could be added
there if we ever decide UFFDIO_MOVE should preserve markers in WP
VMAs too).

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Reply via email to