On Wed, Jul 09, 2025 at 02:13:41PM -0400, Liam R. Howlett wrote: > * Lorenzo Stoakes <lorenzo.stoa...@oracle.com> [250707 01:28]: > > Historically we've made it a uAPI requirement that mremap() may only > > operate on a single VMA at a time. > > > > For instances where VMAs need to be resized, this makes sense, as it > > becomes very difficult to determine what a user actually wants should they > > indicate a desire to expand or shrink the size of multiple VMAs (truncate? > > Adjust sizes individually? Some other strategy?). > > > > However, in instances where a user is moving VMAs, it is restrictive to > > disallow this. > > > > This is especially the case when anonymous mapping remap may or may not be > > mergeable depending on whether VMAs have or have not been faulted due to > > anon_vma assignment and folio index alignment with vma->vm_pgoff. > > > > Often this can result in surprising impact where a moved region is faulted, > > then moved back and a user fails to observe a merge from otherwise > > compatible, adjacent VMAs. > > > > This change allows such cases to work without the user having to be > > cognizant of whether a prior mremap() move or other VMA operations has > > resulted in VMA fragmentation. > > > > Having refactored mremap code to aggregate per-VMA and parameter checks, we > > are now in a position to permit this kind of move. > > > > We do so by detecting if this is a move-only operation up-front, and then > > utilising a separate code path via remap_move() rather than the ordinary > > single-VMA path. > > > > There are two tasks that occur outside of the mmap write lock - userfaultfd > > notification and population of unmapped regions of expanded VMAs should the > > VMA be mlock()'d. > > > > The latter doesn't apply, as this is logic for a move only and thus no > > expansion can take place. In the former case, we explicitly disallow > > multi-VMA operations on uffd-armed VMAs. > > > > The mmap lock is never dropped in the move-only case, this only occurs on a > > VMA shrink. > > > > We take care to handle cases where a VMA merge has occurred, by resetting > > the VMA iterator in such instances. > > > > We needn't worry about self-merges, as in those cases we would, by > > definition, not be spanning multiple VMAs. The overlapping range test is > > performed on the whole range so specifically disallows this. > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoa...@oracle.com> > > --- > > mm/mremap.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++---- > > 1 file changed, 99 insertions(+), 7 deletions(-) > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > index 28e776cddc08..2e6005e1d22c 100644 > > --- a/mm/mremap.c > > +++ b/mm/mremap.c > > @@ -69,6 +69,8 @@ struct vma_remap_struct { > > enum mremap_type remap_type; /* expand, shrink, etc. */ > > bool mmap_locked; /* Is mm currently write-locked? */ > > unsigned long charged; /* If VM_ACCOUNT, # pages to account. */ > > + bool multi_vma; /* Is >1 VMA being moved? */ > > + bool vma_reset; /* Was the VMA merged/unmap occur? */ > > The name doesn't read well in code. vmi_reset or reset_iter might be > better, but I don't really mind it like this.
Yeah it is a bit odd I agree. > > > }; > > > > static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr) > > @@ -1111,6 +1113,7 @@ static void unmap_source_vma(struct vma_remap_struct > > *vrm) > > > > err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= > > */false); > > vrm->vma = NULL; /* Invalidated. */ > > + vrm->vma_reset = true; > > I believe the munmap() operation leaves the vmi in the correct position > to reuse, so this is cautious that costs an extra walk of the tree. I > don't think it's critical to performance, but if it is we can look here. > It would have to be passed through which might be a pain. Yeah I think this means we _always_ reset the VMI as you mention below, unless MREMAP_DONT_UNMAP | MREMAP_FIXED is used. It's right to invalidate the vrm->vma here, as this is the source VMA so is now a dangling pointer. I think the problem I was worried about here was a partial unmap causing a split, and keep in mind we might be moving things backwards also. But I don't think the _iterator_ should be invalidated by this actually right? We'd still be in the correct position. So yeah, I'll drop this. > > > if (err) { > > /* OOM: unable to split vma, just get accounts right */ > > vm_acct_memory(len >> PAGE_SHIFT); > > @@ -1181,6 +1184,7 @@ static int copy_vma_and_data(struct vma_remap_struct > > *vrm, > > > > new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff, > > &pmc.need_rmap_locks); > > + vrm->vma_reset = vma != vrm->vma; > > if (!new_vma) { > > vrm_uncharge(vrm); > > *new_vma_ptr = NULL; > > @@ -1325,6 +1329,7 @@ static unsigned long shrink_vma(struct > > vma_remap_struct *vrm, > > res = do_vmi_munmap(&vmi, mm, unmap_start, unmap_bytes, > > vrm->uf_unmap, drop_lock); > > vrm->vma = NULL; /* Invalidated. */ > > + vrm->vma_reset = true; > > Ditto here, lock depending.. We won't ever drop the lock in a move path to be clear. Only on shrink, which is disallowed for multi VMA move (as is expand). So probably this is overcautious and I'll drop it. > > > if (res) > > return res; > > > > @@ -1362,6 +1367,7 @@ static unsigned long mremap_to(struct > > vma_remap_struct *vrm) > > err = do_munmap(mm, vrm->new_addr, vrm->new_len, > > vrm->uf_unmap_early); > > vrm->vma = NULL; /* Invalidated. */ > > + vrm->vma_reset = true; > > Pretty sure this one is needed, regardless of passing through (and > updating this call). Yes this one for sure. > > > if (err) > > return err; > > > > @@ -1581,6 +1587,18 @@ static bool vrm_will_map_new(struct vma_remap_struct > > *vrm) > > return false; > > } > > > > +/* Does this remap ONLY move mappings? */ > > +static bool vrm_move_only(struct vma_remap_struct *vrm) > > +{ > > + if (!vrm_implies_new_addr(vrm)) > > + return false; > > + > > + if (vrm->old_len != vrm->new_len) > > + return false; > > + > > + return true; > > +} > > + > > static void notify_uffd(struct vma_remap_struct *vrm, bool failed) > > { > > struct mm_struct *mm = current->mm; > > @@ -1644,10 +1662,29 @@ static int check_prep_vma(struct vma_remap_struct > > *vrm) > > (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) > > return -EINVAL; > > > > - /* We can't remap across vm area boundaries */ > > + /* > > + * We can't remap across the end of VMAs, as another VMA may be > > + * adjacent: > > + * > > + * addr vma->vm_end > > + * |-----.----------| > > + * | . | > > + * |-----.----------| > > + * .<--------->xxx> > > + * old_len > > + * > > + * We also require that vma->vm_start <= addr < vma->vm_end. > > + */ > > if (old_len > vma->vm_end - addr) > > return -EFAULT; > > > > + /* > > + * We can't support moving multiple uffd VMAs as notify requires mmap > > + * lock to be dropped. > > + */ > > + if (vrm->multi_vma && userfaultfd_armed(vma)) > > + return -EINVAL; > > + > > if (new_len <= old_len) > > return 0; > > > > @@ -1744,6 +1781,57 @@ static unsigned long check_mremap_params(struct > > vma_remap_struct *vrm) > > return 0; > > } > > > > +static unsigned long remap_move(struct vma_remap_struct *vrm) > > +{ > > + struct vm_area_struct *vma; > > + unsigned long start = vrm->addr; > > + unsigned long end = vrm->addr + vrm->old_len; > > + unsigned long new_addr = vrm->new_addr; > > + unsigned long prev_addr = start; > > + VMA_ITERATOR(vmi, current->mm, start); > > + > > + /* > > + * When moving VMAs we allow for batched moves across multiple VMAs, > > + * with all VMAs in the input range [addr, addr + old_len) being moved > > + * (and split as necessary). > > + */ > > + for_each_vma_range(vmi, vma, end) { > > + unsigned long addr = max(vma->vm_start, start); > > + unsigned long len = min(end, vma->vm_end) - addr; > > + unsigned long offset = addr - start; > > + unsigned long res; > > + > > + /* Merged with self, move on. */ > > + if (vrm->multi_vma && prev_addr == addr) > > + continue; > > + > > + vrm->vma = vma; > > + vrm->addr = addr; > > + vrm->new_addr = new_addr + offset; > > + vrm->old_len = vrm->new_len = len; > > + > > + res = check_prep_vma(vrm); > > + if (!res) > > + res = mremap_to(vrm); > > + if (IS_ERR_VALUE(res)) > > + return res; > > + > > + /* mmap lock is only dropped on shrink. */ > > + VM_WARN_ON_ONCE(!vrm->mmap_locked); > > + /* This is a move, no expand should occur. */ > > + VM_WARN_ON_ONCE(vrm->populate_expand); > > + > > + if (vrm->vma_reset) { > > + vma_iter_reset(&vmi); > > + vrm->vma_reset = false; > > + } > > What code path results in vma_reset == false here? Yeah that's a good point, only MREMAP_DONT_UNMAP | MREMAP_FIXED will fail to hit it, so let's drop for unmaps. I will test this is all good too. > > > + vrm->multi_vma = true; > > + prev_addr = addr; > > + } > > + > > + return new_addr; > > +} > > The iterator use looks good. Thanks! > > > + > > static unsigned long do_mremap(struct vma_remap_struct *vrm) > > { > > struct mm_struct *mm = current->mm; > > @@ -1761,13 +1849,17 @@ static unsigned long do_mremap(struct > > vma_remap_struct *vrm) > > return -EINTR; > > vrm->mmap_locked = true; > > > > - vrm->vma = vma_lookup(current->mm, vrm->addr); > > - res = check_prep_vma(vrm); > > - if (res) > > - goto out; > > + if (vrm_move_only(vrm)) { > > + res = remap_move(vrm); > > + } else { > > + vrm->vma = vma_lookup(current->mm, vrm->addr); > > + res = check_prep_vma(vrm); > > + if (res) > > + goto out; > > > > - /* Actually execute mremap. */ > > - res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm); > > + /* Actually execute mremap. */ > > + res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : > > mremap_at(vrm); > > + } > > > > out: > > failed = IS_ERR_VALUE(res); > > -- > > 2.50.0 > >