On 2026-02-02 at 22:44 +1100, Thomas Hellström 
<[email protected]> wrote...
> On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote:
> > On 2026-02-02 at 21:41 +1100, Thomas Hellström
> > <[email protected]> wrote...
> > > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > > > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > > > <[email protected]> wrote...
> > > > > Hi,
> > > > > 
> > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > > > <[email protected]>
> > > > > > wrote...
> > > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard
> > > > > > > wrote:
> > > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard
> > > > > > > > > wrote:
> > > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas
> > > > > > > > > > > Hellström
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > ...
> > > > > > > > > > > I’m not convinced the folio refcount has any
> > > > > > > > > > > bearing if
> > > > > > > > > > > we
> > > > > > > > > > > can take a
> > > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m
> > > > > > > > > > > missing
> > > > > > > > > > > something.
> > > > > > 
> > > > > > I think the point of the trylock vs. lock is that if you
> > > > > > can't
> > > > > > immediately
> > > > > > lock the page then it's an indication the page is undergoing
> > > > > > a
> > > > > > migration.
> > > > > > In other words there's no point waiting for the lock and then
> > > > > > trying
> > > > > > to call
> > > > > > migrate_to_ram() as the page will have already moved by the
> > > > > > time
> > > > > > you
> > > > > > acquire
> > > > > > the lock. Of course that just means you spin faulting until
> > > > > > the
> > > > > > page
> > > > > > finally
> > > > > > migrates.
> > > > > > 
> > > > > > If I'm understanding the problem it sounds like we just want
> > > > > > to
> > > > > > sleep
> > > > > > until the
> > > > > > migration is complete, ie. same as the migration entry path.
> > > > > > We
> > > > > > don't
> > > > > > have a
> > > > > > device_private_entry_wait() function, but I don't think we
> > > > > > need
> > > > > > one,
> > > > > > see below.
> > > > > > 
> > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > > > --- a/mm/memory.c
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct
> > > > > > > > > vm_fault
> > > > > > > > > *vmf)
> > > > > > > > >                          vmf->page =
> > > > > > > > > softleaf_to_page(entry);
> > > > > > > > >                          ret =
> > > > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > > >                  } else if
> > > > > > > > > (softleaf_is_device_private(entry))
> > > > > > > > > {
> > > > > > > > > +                       struct dev_pagemap *pgmap;
> > > > > > > > > +
> > > > > > > > >                          if (vmf->flags &
> > > > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > > > {
> > > > > > > > >                                  /*
> > > > > > > > >                                   * migrate_to_ram is
> > > > > > > > > not
> > > > > > > > > yet
> > > > > > > > > ready to operate
> > > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct
> > > > > > > > > vm_fault
> > > > > > > > > *vmf)
> > > > > > > > >                                                        
> > > > > > > > >  
> > > > > > > > > vmf-
> > > > > > > > > > orig_pte)))
> > > > > > > > >                                  goto unlock;
> > > > > > > > > 
> > > > > > > > > -                       /*
> > > > > > > > > -                        * Get a page reference while
> > > > > > > > > we
> > > > > > > > > know
> > > > > > > > > the page can't be
> > > > > > > > > -                        * freed.
> > > > > > > > > -                        */
> > > > > > > > > -                       if (trylock_page(vmf->page)) {
> > > > > > > > > -                               struct dev_pagemap
> > > > > > > > > *pgmap;
> > > > > > > > > -
> > > > > > > > > -                               get_page(vmf->page);
> > > > > > 
> > > > > > At this point we:
> > > > > > 1. Know the page needs to migrate
> > > > > > 2. Have the page locked
> > > > > > 3. Have a reference on the page
> > > > > > 4. Have the PTL locked
> > > > > > 
> > > > > > Or in other words we have everything we need to install a
> > > > > > migration
> > > > > > entry,
> > > > > > so why not just do that? This thread would then proceed into
> > > > > > migrate_to_ram()
> > > > > > having already done migrate_vma_collect_pmd() for the
> > > > > > faulting
> > > > > > page
> > > > > > and any
> > > > > > other threads would just sleep in the wait on migration entry
> > > > > > path
> > > > > > until the
> > > > > > migration is complete, avoiding the livelock problem the
> > > > > > trylock
> > > > > > was
> > > > > > introduced
> > > > > > for in 1afaeb8293c9a.
> > > > > > 
> > > > > >  - Alistair
> > > > > > 
> > > > > > > > 
> > > > > 
> > > > > There will always be a small time between when the page is
> > > > > locked
> > > > > and
> > > > > when we can install a migration entry. If the page only has a
> > > > > single
> > > > > mapcount, then the PTL lock is held during this time so the
> > > > > issue
> > > > > does
> > > > > not occur. But for multiple map-counts we need to release the
> > > > > PTL
> > > > > lock
> > > > > in migration to run try_to_migrate(), and before that, the
> > > > > migrate
> > > > > code
> > > > > is running lru_add_drain_all() and gets stuck.
> > > > 
> > > > Oh right, my solution would be fine for the single mapping case
> > > > but I
> > > > hadn't
> > > > fully thought through the implications of other threads accessing
> > > > this for
> > > > multiple map-counts. Agree it doesn't solve anything there (the
> > > > rest
> > > > of the
> > > > threads would still spin on the trylock).
> > > > 
> > > > Still we could use a similar solution for waiting on device-
> > > > private
> > > > entries as
> > > > we do for migration entries. Instead of spinning on the trylock
> > > > (ie.
> > > > PG_locked)
> > > > we could just wait on it to become unlocked if it's already
> > > > locked.
> > > > Would
> > > > something like the below completely untested code work?
> > > > (obviously
> > > > this is a bit
> > > > of hack, to do it properly you'd want to do more than just remove
> > > > the
> > > > check from
> > > > migration_entry_wait)
> > > 
> > > Well I guess there could be failed migration where something is
> > > aborting the migration even after a page is locked. Also we must
> > > unlock
> > > the PTL lock before waiting otherwise we could deadlock.
> > 
> > Yes, this is exactly what the migration entry wait code does. And if
> > there's a
> > failed migration, no problem, you just retry. That's not a deadlock
> > unless the
> > migration never succeeds and then your stuffed anyway.
> > 
> > > I believe a robust solution would be to take a folio reference and
> > > do a
> > > sleeping lock like John's example. Then to assert that a folio pin-
> > > count, not ref-count is required to pin a device-private folio.
> > > That
> > > would eliminate the problem of the refcount held while locking
> > > blocking
> > > migration. It looks like that's fully consistent with 
> > 
> > Waiting on a migration entry like in my example below is exactly the
> > same as
> > sleeping on the page lock other than it just waits for the page to be
> > unlocked
> > rather than trying to lock it.
> > 
> > Internally migration_entry_wait_on_locked() is just an open-coded
> > version
> > of folio_lock() which deals with dropping the PTL and that works
> > without a page
> > refcount.
> > 
> > So I don't understand how this solution isn't robust? It requires no
> > funniness
> > with refcounts and works practically the same as a sleeping lock.
> 
> You're right. I didn't look closely enough into what the
> migration_entry_wait_on_locked() did. Sorry about that.

No worries. I'm somewhat familiar with it from updating it specifically so it
wouldn't take a page reference as we used to have similar live-lock/starvation
issues in that path too.

> That would indeed fix the problem as well. Then the only argument
> remaining for the get-a-reference-and-lock solution would be it's not
> starvation prone in the same way. But that's definitely a problem I
> think we could live with for now.

I don't follow how this solution would be any more starvation prone than getting
a reference and locking - here the winning fault takes the lock and any other
faulting threads would just wait until it was released before returning from
the fault handler assuming it had been handled. But it's been a while since I've
thought about all the scenarios here so maybe I missed one.

> I'll give this code a test. BTW that removal of unlock_page() isn't
> intentional, right? 

Thanks. And you're right, that was unintentional. Serves me for responding too
late at night :-)

 - Alistair

> Thanks,
> Thomas
> 
> 
> > 
> >  - Alistair
> > 
> > > https://docs.kernel.org/core-api/pin_user_pages.html
> > > 
> > > Then as general improvements we should fully unmap pages before
> > > calling
> > > lru_add_drain_all() as MBrost suggest and finally, to be more nice
> > > to
> > > the system in the common cases, add a cond_resched() to
> > > hmm_range_fault().
> > > 
> > > Thanks,
> > > Thomas
> > > 
> > > 
> > > 
> > > > 
> > > > ---
> > > > 
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 2a55edc48a65..3e5e205ee279 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct vm_fault
> > > > *vmf)
> > > >                                 pte_unmap_unlock(vmf->pte, vmf-
> > > > > ptl);
> > > >                                 pgmap = page_pgmap(vmf->page);
> > > >                                 ret = pgmap->ops-
> > > > > migrate_to_ram(vmf);
> > > > -                               unlock_page(vmf->page);
> > > >                                 put_page(vmf->page);
> > > >                         } else {
> > > > -                               pte_unmap_unlock(vmf->pte, vmf-
> > > > > ptl);
> > > > +                               migration_entry_wait(vma->vm_mm,
> > > > vmf->pmd,
> > > > +                                                    vmf-
> > > > >address);
> > > >                         }
> > > >                 } else if (softleaf_is_hwpoison(entry)) {
> > > >                         ret = VM_FAULT_HWPOISON;
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index 5169f9717f60..b676daf0f4e8 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct mm_struct
> > > > *mm,
> > > > pmd_t *pmd,
> > > >                 goto out;
> > > >  
> > > >         entry = softleaf_from_pte(pte);
> > > > -       if (!softleaf_is_migration(entry))
> > > > -               goto out;
> > > >  
> > > >         migration_entry_wait_on_locked(entry, ptl);
> > > >         return;

Reply via email to