On 2026-02-02 at 21:41 +1100, Thomas Hellström <[email protected]> wrote... > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote: > > On 2026-02-02 at 20:30 +1100, Thomas Hellström > > <[email protected]> wrote... > > > Hi, > > > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote: > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost > > > > <[email protected]> > > > > wrote... > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote: > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote: > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard > > > > > > > wrote: > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote: > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas > > > > > > > > > Hellström > > > > > > > > > wrote: > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard > > > > > > > > > > wrote: > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote: > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas > > > > > > > > > > > > Hellström > > > > > > > > > > > > wrote: > > > > > > > > > > > ... > > > > > > > > > I’m not convinced the folio refcount has any bearing if > > > > > > > > > we > > > > > > > > > can take a > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m missing > > > > > > > > > something. > > > > > > > > I think the point of the trylock vs. lock is that if you can't > > > > immediately > > > > lock the page then it's an indication the page is undergoing a > > > > migration. > > > > In other words there's no point waiting for the lock and then > > > > trying > > > > to call > > > > migrate_to_ram() as the page will have already moved by the time > > > > you > > > > acquire > > > > the lock. Of course that just means you spin faulting until the > > > > page > > > > finally > > > > migrates. > > > > > > > > If I'm understanding the problem it sounds like we just want to > > > > sleep > > > > until the > > > > migration is complete, ie. same as the migration entry path. We > > > > don't > > > > have a > > > > device_private_entry_wait() function, but I don't think we need > > > > one, > > > > see below. > > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644 > > > > > > > --- a/mm/memory.c > > > > > > > +++ b/mm/memory.c > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct > > > > > > > vm_fault > > > > > > > *vmf) > > > > > > > vmf->page = > > > > > > > softleaf_to_page(entry); > > > > > > > ret = > > > > > > > remove_device_exclusive_entry(vmf); > > > > > > > } else if > > > > > > > (softleaf_is_device_private(entry)) > > > > > > > { > > > > > > > + struct dev_pagemap *pgmap; > > > > > > > + > > > > > > > if (vmf->flags & > > > > > > > FAULT_FLAG_VMA_LOCK) > > > > > > > { > > > > > > > /* > > > > > > > * migrate_to_ram is not > > > > > > > yet > > > > > > > ready to operate > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct > > > > > > > vm_fault > > > > > > > *vmf) > > > > > > > > > > > > > > vmf- > > > > > > > > orig_pte))) > > > > > > > goto unlock; > > > > > > > > > > > > > > - /* > > > > > > > - * Get a page reference while we > > > > > > > know > > > > > > > the page can't be > > > > > > > - * freed. > > > > > > > - */ > > > > > > > - if (trylock_page(vmf->page)) { > > > > > > > - struct dev_pagemap *pgmap; > > > > > > > - > > > > > > > - get_page(vmf->page); > > > > > > > > At this point we: > > > > 1. Know the page needs to migrate > > > > 2. Have the page locked > > > > 3. Have a reference on the page > > > > 4. Have the PTL locked > > > > > > > > Or in other words we have everything we need to install a > > > > migration > > > > entry, > > > > so why not just do that? This thread would then proceed into > > > > migrate_to_ram() > > > > having already done migrate_vma_collect_pmd() for the faulting > > > > page > > > > and any > > > > other threads would just sleep in the wait on migration entry > > > > path > > > > until the > > > > migration is complete, avoiding the livelock problem the trylock > > > > was > > > > introduced > > > > for in 1afaeb8293c9a. > > > > > > > > - Alistair > > > > > > > > > > > > > > > > There will always be a small time between when the page is locked > > > and > > > when we can install a migration entry. If the page only has a > > > single > > > mapcount, then the PTL lock is held during this time so the issue > > > does > > > not occur. But for multiple map-counts we need to release the PTL > > > lock > > > in migration to run try_to_migrate(), and before that, the migrate > > > code > > > is running lru_add_drain_all() and gets stuck. > > > > Oh right, my solution would be fine for the single mapping case but I > > hadn't > > fully thought through the implications of other threads accessing > > this for > > multiple map-counts. Agree it doesn't solve anything there (the rest > > of the > > threads would still spin on the trylock). > > > > Still we could use a similar solution for waiting on device-private > > entries as > > we do for migration entries. Instead of spinning on the trylock (ie. > > PG_locked) > > we could just wait on it to become unlocked if it's already locked. > > Would > > something like the below completely untested code work? (obviously > > this is a bit > > of hack, to do it properly you'd want to do more than just remove the > > check from > > migration_entry_wait) > > Well I guess there could be failed migration where something is > aborting the migration even after a page is locked. Also we must unlock > the PTL lock before waiting otherwise we could deadlock.
Yes, this is exactly what the migration entry wait code does. And if there's a failed migration, no problem, you just retry. That's not a deadlock unless the migration never succeeds and then your stuffed anyway. > I believe a robust solution would be to take a folio reference and do a > sleeping lock like John's example. Then to assert that a folio pin- > count, not ref-count is required to pin a device-private folio. That > would eliminate the problem of the refcount held while locking blocking > migration. It looks like that's fully consistent with Waiting on a migration entry like in my example below is exactly the same as sleeping on the page lock other than it just waits for the page to be unlocked rather than trying to lock it. Internally migration_entry_wait_on_locked() is just an open-coded version of folio_lock() which deals with dropping the PTL and that works without a page refcount. So I don't understand how this solution isn't robust? It requires no funniness with refcounts and works practically the same as a sleeping lock. - Alistair > https://docs.kernel.org/core-api/pin_user_pages.html > > Then as general improvements we should fully unmap pages before calling > lru_add_drain_all() as MBrost suggest and finally, to be more nice to > the system in the common cases, add a cond_resched() to > hmm_range_fault(). > > Thanks, > Thomas > > > > > > > --- > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 2a55edc48a65..3e5e205ee279 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > pte_unmap_unlock(vmf->pte, vmf- > > >ptl); > > pgmap = page_pgmap(vmf->page); > > ret = pgmap->ops- > > >migrate_to_ram(vmf); > > - unlock_page(vmf->page); > > put_page(vmf->page); > > } else { > > - pte_unmap_unlock(vmf->pte, vmf- > > >ptl); > > + migration_entry_wait(vma->vm_mm, > > vmf->pmd, > > + vmf->address); > > } > > } else if (softleaf_is_hwpoison(entry)) { > > ret = VM_FAULT_HWPOISON; > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 5169f9717f60..b676daf0f4e8 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct *mm, > > pmd_t *pmd, > > goto out; > > > > entry = softleaf_from_pte(pte); > > - if (!softleaf_is_migration(entry)) > > - goto out; > > > > migration_entry_wait_on_locked(entry, ptl); > > return;
