On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote: > On 2026-02-02 at 20:30 +1100, Thomas Hellström > <[email protected]> wrote... > > Hi, > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote: > > > On 2026-02-02 at 08:07 +1100, Matthew Brost > > > <[email protected]> > > > wrote... > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote: > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote: > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard > > > > > > wrote: > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote: > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas > > > > > > > > Hellström > > > > > > > > wrote: > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard > > > > > > > > > wrote: > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote: > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas > > > > > > > > > > > Hellström > > > > > > > > > > > wrote: > > > > > > > > > > ... > > > > > > > > I’m not convinced the folio refcount has any bearing if > > > > > > > > we > > > > > > > > can take a > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m missing > > > > > > > > something. > > > > > > I think the point of the trylock vs. lock is that if you can't > > > immediately > > > lock the page then it's an indication the page is undergoing a > > > migration. > > > In other words there's no point waiting for the lock and then > > > trying > > > to call > > > migrate_to_ram() as the page will have already moved by the time > > > you > > > acquire > > > the lock. Of course that just means you spin faulting until the > > > page > > > finally > > > migrates. > > > > > > If I'm understanding the problem it sounds like we just want to > > > sleep > > > until the > > > migration is complete, ie. same as the migration entry path. We > > > don't > > > have a > > > device_private_entry_wait() function, but I don't think we need > > > one, > > > see below. > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644 > > > > > > --- a/mm/memory.c > > > > > > +++ b/mm/memory.c > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct > > > > > > vm_fault > > > > > > *vmf) > > > > > > vmf->page = > > > > > > softleaf_to_page(entry); > > > > > > ret = > > > > > > remove_device_exclusive_entry(vmf); > > > > > > } else if > > > > > > (softleaf_is_device_private(entry)) > > > > > > { > > > > > > + struct dev_pagemap *pgmap; > > > > > > + > > > > > > if (vmf->flags & > > > > > > FAULT_FLAG_VMA_LOCK) > > > > > > { > > > > > > /* > > > > > > * migrate_to_ram is not > > > > > > yet > > > > > > ready to operate > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct > > > > > > vm_fault > > > > > > *vmf) > > > > > > > > > > > > vmf- > > > > > > > orig_pte))) > > > > > > goto unlock; > > > > > > > > > > > > - /* > > > > > > - * Get a page reference while we > > > > > > know > > > > > > the page can't be > > > > > > - * freed. > > > > > > - */ > > > > > > - if (trylock_page(vmf->page)) { > > > > > > - struct dev_pagemap *pgmap; > > > > > > - > > > > > > - get_page(vmf->page); > > > > > > At this point we: > > > 1. Know the page needs to migrate > > > 2. Have the page locked > > > 3. Have a reference on the page > > > 4. Have the PTL locked > > > > > > Or in other words we have everything we need to install a > > > migration > > > entry, > > > so why not just do that? This thread would then proceed into > > > migrate_to_ram() > > > having already done migrate_vma_collect_pmd() for the faulting > > > page > > > and any > > > other threads would just sleep in the wait on migration entry > > > path > > > until the > > > migration is complete, avoiding the livelock problem the trylock > > > was > > > introduced > > > for in 1afaeb8293c9a. > > > > > > - Alistair > > > > > > > > > > > > There will always be a small time between when the page is locked > > and > > when we can install a migration entry. If the page only has a > > single > > mapcount, then the PTL lock is held during this time so the issue > > does > > not occur. But for multiple map-counts we need to release the PTL > > lock > > in migration to run try_to_migrate(), and before that, the migrate > > code > > is running lru_add_drain_all() and gets stuck. > > Oh right, my solution would be fine for the single mapping case but I > hadn't > fully thought through the implications of other threads accessing > this for > multiple map-counts. Agree it doesn't solve anything there (the rest > of the > threads would still spin on the trylock). > > Still we could use a similar solution for waiting on device-private > entries as > we do for migration entries. Instead of spinning on the trylock (ie. > PG_locked) > we could just wait on it to become unlocked if it's already locked. > Would > something like the below completely untested code work? (obviously > this is a bit > of hack, to do it properly you'd want to do more than just remove the > check from > migration_entry_wait)
Well I guess there could be failed migration where something is aborting the migration even after a page is locked. Also we must unlock the PTL lock before waiting otherwise we could deadlock. I believe a robust solution would be to take a folio reference and do a sleeping lock like John's example. Then to assert that a folio pin- count, not ref-count is required to pin a device-private folio. That would eliminate the problem of the refcount held while locking blocking migration. It looks like that's fully consistent with https://docs.kernel.org/core-api/pin_user_pages.html Then as general improvements we should fully unmap pages before calling lru_add_drain_all() as MBrost suggest and finally, to be more nice to the system in the common cases, add a cond_resched() to hmm_range_fault(). Thanks, Thomas > > --- > > diff --git a/mm/memory.c b/mm/memory.c > index 2a55edc48a65..3e5e205ee279 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf- > >ptl); > pgmap = page_pgmap(vmf->page); > ret = pgmap->ops- > >migrate_to_ram(vmf); > - unlock_page(vmf->page); > put_page(vmf->page); > } else { > - pte_unmap_unlock(vmf->pte, vmf- > >ptl); > + migration_entry_wait(vma->vm_mm, > vmf->pmd, > + vmf->address); > } > } else if (softleaf_is_hwpoison(entry)) { > ret = VM_FAULT_HWPOISON; > diff --git a/mm/migrate.c b/mm/migrate.c > index 5169f9717f60..b676daf0f4e8 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct *mm, > pmd_t *pmd, > goto out; > > entry = softleaf_from_pte(pte); > - if (!softleaf_is_migration(entry)) > - goto out; > > migration_entry_wait_on_locked(entry, ptl); > return;
