On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote: > On 2026-02-02 at 21:41 +1100, Thomas Hellström > <[email protected]> wrote... > > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote: > > > On 2026-02-02 at 20:30 +1100, Thomas Hellström > > > <[email protected]> wrote... > > > > Hi, > > > > > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote: > > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost > > > > > <[email protected]> > > > > > wrote... > > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard > > > > > > wrote: > > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote: > > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard > > > > > > > > wrote: > > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote: > > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas > > > > > > > > > > Hellström > > > > > > > > > > wrote: > > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard > > > > > > > > > > > wrote: > > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote: > > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas > > > > > > > > > > > > > Hellström > > > > > > > > > > > > > wrote: > > > > > > > > > > > > ... > > > > > > > > > > I’m not convinced the folio refcount has any > > > > > > > > > > bearing if > > > > > > > > > > we > > > > > > > > > > can take a > > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m > > > > > > > > > > missing > > > > > > > > > > something. > > > > > > > > > > I think the point of the trylock vs. lock is that if you > > > > > can't > > > > > immediately > > > > > lock the page then it's an indication the page is undergoing > > > > > a > > > > > migration. > > > > > In other words there's no point waiting for the lock and then > > > > > trying > > > > > to call > > > > > migrate_to_ram() as the page will have already moved by the > > > > > time > > > > > you > > > > > acquire > > > > > the lock. Of course that just means you spin faulting until > > > > > the > > > > > page > > > > > finally > > > > > migrates. > > > > > > > > > > If I'm understanding the problem it sounds like we just want > > > > > to > > > > > sleep > > > > > until the > > > > > migration is complete, ie. same as the migration entry path. > > > > > We > > > > > don't > > > > > have a > > > > > device_private_entry_wait() function, but I don't think we > > > > > need > > > > > one, > > > > > see below. > > > > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644 > > > > > > > > --- a/mm/memory.c > > > > > > > > +++ b/mm/memory.c > > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct > > > > > > > > vm_fault > > > > > > > > *vmf) > > > > > > > > vmf->page = > > > > > > > > softleaf_to_page(entry); > > > > > > > > ret = > > > > > > > > remove_device_exclusive_entry(vmf); > > > > > > > > } else if > > > > > > > > (softleaf_is_device_private(entry)) > > > > > > > > { > > > > > > > > + struct dev_pagemap *pgmap; > > > > > > > > + > > > > > > > > if (vmf->flags & > > > > > > > > FAULT_FLAG_VMA_LOCK) > > > > > > > > { > > > > > > > > /* > > > > > > > > * migrate_to_ram is > > > > > > > > not > > > > > > > > yet > > > > > > > > ready to operate > > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct > > > > > > > > vm_fault > > > > > > > > *vmf) > > > > > > > > > > > > > > > > > > > > > > > > vmf- > > > > > > > > > orig_pte))) > > > > > > > > goto unlock; > > > > > > > > > > > > > > > > - /* > > > > > > > > - * Get a page reference while > > > > > > > > we > > > > > > > > know > > > > > > > > the page can't be > > > > > > > > - * freed. > > > > > > > > - */ > > > > > > > > - if (trylock_page(vmf->page)) { > > > > > > > > - struct dev_pagemap > > > > > > > > *pgmap; > > > > > > > > - > > > > > > > > - get_page(vmf->page); > > > > > > > > > > At this point we: > > > > > 1. Know the page needs to migrate > > > > > 2. Have the page locked > > > > > 3. Have a reference on the page > > > > > 4. Have the PTL locked > > > > > > > > > > Or in other words we have everything we need to install a > > > > > migration > > > > > entry, > > > > > so why not just do that? This thread would then proceed into > > > > > migrate_to_ram() > > > > > having already done migrate_vma_collect_pmd() for the > > > > > faulting > > > > > page > > > > > and any > > > > > other threads would just sleep in the wait on migration entry > > > > > path > > > > > until the > > > > > migration is complete, avoiding the livelock problem the > > > > > trylock > > > > > was > > > > > introduced > > > > > for in 1afaeb8293c9a. > > > > > > > > > > - Alistair > > > > > > > > > > > > > > > > > > > > There will always be a small time between when the page is > > > > locked > > > > and > > > > when we can install a migration entry. If the page only has a > > > > single > > > > mapcount, then the PTL lock is held during this time so the > > > > issue > > > > does > > > > not occur. But for multiple map-counts we need to release the > > > > PTL > > > > lock > > > > in migration to run try_to_migrate(), and before that, the > > > > migrate > > > > code > > > > is running lru_add_drain_all() and gets stuck. > > > > > > Oh right, my solution would be fine for the single mapping case > > > but I > > > hadn't > > > fully thought through the implications of other threads accessing > > > this for > > > multiple map-counts. Agree it doesn't solve anything there (the > > > rest > > > of the > > > threads would still spin on the trylock). > > > > > > Still we could use a similar solution for waiting on device- > > > private > > > entries as > > > we do for migration entries. Instead of spinning on the trylock > > > (ie. > > > PG_locked) > > > we could just wait on it to become unlocked if it's already > > > locked. > > > Would > > > something like the below completely untested code work? > > > (obviously > > > this is a bit > > > of hack, to do it properly you'd want to do more than just remove > > > the > > > check from > > > migration_entry_wait) > > > > Well I guess there could be failed migration where something is > > aborting the migration even after a page is locked. Also we must > > unlock > > the PTL lock before waiting otherwise we could deadlock. > > Yes, this is exactly what the migration entry wait code does. And if > there's a > failed migration, no problem, you just retry. That's not a deadlock > unless the > migration never succeeds and then your stuffed anyway. > > > I believe a robust solution would be to take a folio reference and > > do a > > sleeping lock like John's example. Then to assert that a folio pin- > > count, not ref-count is required to pin a device-private folio. > > That > > would eliminate the problem of the refcount held while locking > > blocking > > migration. It looks like that's fully consistent with > > Waiting on a migration entry like in my example below is exactly the > same as > sleeping on the page lock other than it just waits for the page to be > unlocked > rather than trying to lock it. > > Internally migration_entry_wait_on_locked() is just an open-coded > version > of folio_lock() which deals with dropping the PTL and that works > without a page > refcount. > > So I don't understand how this solution isn't robust? It requires no > funniness > with refcounts and works practically the same as a sleeping lock.
You're right. I didn't look closely enough into what the migration_entry_wait_on_locked() did. Sorry about that. That would indeed fix the problem as well. Then the only argument remaining for the get-a-reference-and-lock solution would be it's not starvation prone in the same way. But that's definitely a problem I think we could live with for now. I'll give this code a test. BTW that removal of unlock_page() isn't intentional, right? Thanks, Thomas > > - Alistair > > > https://docs.kernel.org/core-api/pin_user_pages.html > > > > Then as general improvements we should fully unmap pages before > > calling > > lru_add_drain_all() as MBrost suggest and finally, to be more nice > > to > > the system in the common cases, add a cond_resched() to > > hmm_range_fault(). > > > > Thanks, > > Thomas > > > > > > > > > > > > --- > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 2a55edc48a65..3e5e205ee279 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault > > > *vmf) > > > pte_unmap_unlock(vmf->pte, vmf- > > > > ptl); > > > pgmap = page_pgmap(vmf->page); > > > ret = pgmap->ops- > > > > migrate_to_ram(vmf); > > > - unlock_page(vmf->page); > > > put_page(vmf->page); > > > } else { > > > - pte_unmap_unlock(vmf->pte, vmf- > > > > ptl); > > > + migration_entry_wait(vma->vm_mm, > > > vmf->pmd, > > > + vmf- > > > >address); > > > } > > > } else if (softleaf_is_hwpoison(entry)) { > > > ret = VM_FAULT_HWPOISON; > > > diff --git a/mm/migrate.c b/mm/migrate.c > > > index 5169f9717f60..b676daf0f4e8 100644 > > > --- a/mm/migrate.c > > > +++ b/mm/migrate.c > > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct > > > *mm, > > > pmd_t *pmd, > > > goto out; > > > > > > entry = softleaf_from_pte(pte); > > > - if (!softleaf_is_migration(entry)) > > > - goto out; > > > > > > migration_entry_wait_on_locked(entry, ptl); > > > return;
