On Mon, 2026-02-02 at 23:26 +1100, Alistair Popple wrote:
> On 2026-02-02 at 22:44 +1100, Thomas Hellström
> <[email protected]> wrote...
> > On Mon, 2026-02-02 at 22:22 +1100, Alistair Popple wrote:
> > > On 2026-02-02 at 21:41 +1100, Thomas Hellström
> > > <[email protected]> wrote...
> > > > On Mon, 2026-02-02 at 21:25 +1100, Alistair Popple wrote:
> > > > > On 2026-02-02 at 20:30 +1100, Thomas Hellström
> > > > > <[email protected]> wrote...
> > > > > > Hi,
> > > > > > 
> > > > > > On Mon, 2026-02-02 at 11:10 +1100, Alistair Popple wrote:
> > > > > > > On 2026-02-02 at 08:07 +1100, Matthew Brost
> > > > > > > <[email protected]>
> > > > > > > wrote...
> > > > > > > > On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard
> > > > > > > > wrote:
> > > > > > > > > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > > > > > > > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John
> > > > > > > > > > Hubbard
> > > > > > > > > > wrote:
> > > > > > > > > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > > > > > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > > Hellström
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Fri, 2026-01-30 at 19:01 -0800, John
> > > > > > > > > > > > > Hubbard
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas
> > > > > > > > > > > > > > > Hellström
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > I’m not convinced the folio refcount has any
> > > > > > > > > > > > bearing if
> > > > > > > > > > > > we
> > > > > > > > > > > > can take a
> > > > > > > > > > > > sleeping lock in do_swap_page, but perhaps I’m
> > > > > > > > > > > > missing
> > > > > > > > > > > > something.
> > > > > > > 
> > > > > > > I think the point of the trylock vs. lock is that if you
> > > > > > > can't
> > > > > > > immediately
> > > > > > > lock the page then it's an indication the page is
> > > > > > > undergoing
> > > > > > > a
> > > > > > > migration.
> > > > > > > In other words there's no point waiting for the lock and
> > > > > > > then
> > > > > > > trying
> > > > > > > to call
> > > > > > > migrate_to_ram() as the page will have already moved by
> > > > > > > the
> > > > > > > time
> > > > > > > you
> > > > > > > acquire
> > > > > > > the lock. Of course that just means you spin faulting
> > > > > > > until
> > > > > > > the
> > > > > > > page
> > > > > > > finally
> > > > > > > migrates.
> > > > > > > 
> > > > > > > If I'm understanding the problem it sounds like we just
> > > > > > > want
> > > > > > > to
> > > > > > > sleep
> > > > > > > until the
> > > > > > > migration is complete, ie. same as the migration entry
> > > > > > > path.
> > > > > > > We
> > > > > > > don't
> > > > > > > have a
> > > > > > > device_private_entry_wait() function, but I don't think
> > > > > > > we
> > > > > > > need
> > > > > > > one,
> > > > > > > see below.
> > > > > > > 
> > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > > > > > > > > --- a/mm/memory.c
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -4652,6 +4652,8 @@ vm_fault_t
> > > > > > > > > > do_swap_page(struct
> > > > > > > > > > vm_fault
> > > > > > > > > > *vmf)
> > > > > > > > > >                          vmf->page =
> > > > > > > > > > softleaf_to_page(entry);
> > > > > > > > > >                          ret =
> > > > > > > > > > remove_device_exclusive_entry(vmf);
> > > > > > > > > >                  } else if
> > > > > > > > > > (softleaf_is_device_private(entry))
> > > > > > > > > > {
> > > > > > > > > > +                       struct dev_pagemap *pgmap;
> > > > > > > > > > +
> > > > > > > > > >                          if (vmf->flags &
> > > > > > > > > > FAULT_FLAG_VMA_LOCK)
> > > > > > > > > > {
> > > > > > > > > >                                  /*
> > > > > > > > > >                                   * migrate_to_ram
> > > > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > yet
> > > > > > > > > > ready to operate
> > > > > > > > > > @@ -4670,21 +4672,15 @@ vm_fault_t
> > > > > > > > > > do_swap_page(struct
> > > > > > > > > > vm_fault
> > > > > > > > > > *vmf)
> > > > > > > > > >                                                    
> > > > > > > > > >     
> > > > > > > > > >  
> > > > > > > > > > vmf-
> > > > > > > > > > > orig_pte)))
> > > > > > > > > >                                  goto unlock;
> > > > > > > > > > 
> > > > > > > > > > -                       /*
> > > > > > > > > > -                        * Get a page reference
> > > > > > > > > > while
> > > > > > > > > > we
> > > > > > > > > > know
> > > > > > > > > > the page can't be
> > > > > > > > > > -                        * freed.
> > > > > > > > > > -                        */
> > > > > > > > > > -                       if (trylock_page(vmf-
> > > > > > > > > > >page)) {
> > > > > > > > > > -                               struct dev_pagemap
> > > > > > > > > > *pgmap;
> > > > > > > > > > -
> > > > > > > > > > -                               get_page(vmf-
> > > > > > > > > > >page);
> > > > > > > 
> > > > > > > At this point we:
> > > > > > > 1. Know the page needs to migrate
> > > > > > > 2. Have the page locked
> > > > > > > 3. Have a reference on the page
> > > > > > > 4. Have the PTL locked
> > > > > > > 
> > > > > > > Or in other words we have everything we need to install a
> > > > > > > migration
> > > > > > > entry,
> > > > > > > so why not just do that? This thread would then proceed
> > > > > > > into
> > > > > > > migrate_to_ram()
> > > > > > > having already done migrate_vma_collect_pmd() for the
> > > > > > > faulting
> > > > > > > page
> > > > > > > and any
> > > > > > > other threads would just sleep in the wait on migration
> > > > > > > entry
> > > > > > > path
> > > > > > > until the
> > > > > > > migration is complete, avoiding the livelock problem the
> > > > > > > trylock
> > > > > > > was
> > > > > > > introduced
> > > > > > > for in 1afaeb8293c9a.
> > > > > > > 
> > > > > > >  - Alistair
> > > > > > > 
> > > > > > > > > 
> > > > > > 
> > > > > > There will always be a small time between when the page is
> > > > > > locked
> > > > > > and
> > > > > > when we can install a migration entry. If the page only has
> > > > > > a
> > > > > > single
> > > > > > mapcount, then the PTL lock is held during this time so the
> > > > > > issue
> > > > > > does
> > > > > > not occur. But for multiple map-counts we need to release
> > > > > > the
> > > > > > PTL
> > > > > > lock
> > > > > > in migration to run try_to_migrate(), and before that, the
> > > > > > migrate
> > > > > > code
> > > > > > is running lru_add_drain_all() and gets stuck.
> > > > > 
> > > > > Oh right, my solution would be fine for the single mapping
> > > > > case
> > > > > but I
> > > > > hadn't
> > > > > fully thought through the implications of other threads
> > > > > accessing
> > > > > this for
> > > > > multiple map-counts. Agree it doesn't solve anything there
> > > > > (the
> > > > > rest
> > > > > of the
> > > > > threads would still spin on the trylock).
> > > > > 
> > > > > Still we could use a similar solution for waiting on device-
> > > > > private
> > > > > entries as
> > > > > we do for migration entries. Instead of spinning on the
> > > > > trylock
> > > > > (ie.
> > > > > PG_locked)
> > > > > we could just wait on it to become unlocked if it's already
> > > > > locked.
> > > > > Would
> > > > > something like the below completely untested code work?
> > > > > (obviously
> > > > > this is a bit
> > > > > of hack, to do it properly you'd want to do more than just
> > > > > remove
> > > > > the
> > > > > check from
> > > > > migration_entry_wait)
> > > > 
> > > > Well I guess there could be failed migration where something is
> > > > aborting the migration even after a page is locked. Also we
> > > > must
> > > > unlock
> > > > the PTL lock before waiting otherwise we could deadlock.
> > > 
> > > Yes, this is exactly what the migration entry wait code does. And
> > > if
> > > there's a
> > > failed migration, no problem, you just retry. That's not a
> > > deadlock
> > > unless the
> > > migration never succeeds and then your stuffed anyway.
> > > 
> > > > I believe a robust solution would be to take a folio reference
> > > > and
> > > > do a
> > > > sleeping lock like John's example. Then to assert that a folio
> > > > pin-
> > > > count, not ref-count is required to pin a device-private folio.
> > > > That
> > > > would eliminate the problem of the refcount held while locking
> > > > blocking
> > > > migration. It looks like that's fully consistent with 
> > > 
> > > Waiting on a migration entry like in my example below is exactly
> > > the
> > > same as
> > > sleeping on the page lock other than it just waits for the page
> > > to be
> > > unlocked
> > > rather than trying to lock it.
> > > 
> > > Internally migration_entry_wait_on_locked() is just an open-coded
> > > version
> > > of folio_lock() which deals with dropping the PTL and that works
> > > without a page
> > > refcount.
> > > 
> > > So I don't understand how this solution isn't robust? It requires
> > > no
> > > funniness
> > > with refcounts and works practically the same as a sleeping lock.
> > 
> > You're right. I didn't look closely enough into what the
> > migration_entry_wait_on_locked() did. Sorry about that.
> 
> No worries. I'm somewhat familiar with it from updating it
> specifically so it
> wouldn't take a page reference as we used to have similar live-
> lock/starvation
> issues in that path too.
> 
> > That would indeed fix the problem as well. Then the only argument
> > remaining for the get-a-reference-and-lock solution would be it's
> > not
> > starvation prone in the same way. But that's definitely a problem I
> > think we could live with for now.
> 
> I don't follow how this solution would be any more starvation prone
> than getting
> a reference and locking - here the winning fault takes the lock and
> any other
> faulting threads would just wait until it was released before
> returning from
> the fault handler assuming it had been handled. But it's been a while
> since I've
> thought about all the scenarios here so maybe I missed one.

My thinking is that it would be if theoretical racing lock-holders
don't migrate to system, we can't *guarantee* migration will ever
happen. Although admittedly this is very unlikely to happen. If we
instead locked the page we'd on the other hand need to walk the page
table again to check whether the pte content was still valid....


> 
> > I'll give this code a test. BTW that removal of unlock_page() isn't
> > intentional, right? 
> 
> Thanks. And you're right, that was unintentional. Serves me for
> responding too
> late at night :-)

So I ended up with this:




diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..84b6019eac6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                                unlock_page(vmf->page);
                                put_page(vmf->page);
                        } else {
-                               pte_unmap_unlock(vmf->pte, vmf->ptl);
+                               pte_unmap(vmf->pte);
+                               migration_entry_wait_on_locked(entry,
vmf->ptl);
                        }
                } else if (softleaf_is_hwpoison(entry)) {
                        ret = VM_FAULT_HWPOISON;
-- 
2.52.0


Seems to be a working fix.

/Thomas


> 
>  - Alistair
> 
> > Thanks,
> > Thomas
> > 
> > 
> > > 
> > >  - Alistair
> > > 
> > > > https://docs.kernel.org/core-api/pin_user_pages.html
> > > > 
> > > > Then as general improvements we should fully unmap pages before
> > > > calling
> > > > lru_add_drain_all() as MBrost suggest and finally, to be more
> > > > nice
> > > > to
> > > > the system in the common cases, add a cond_resched() to
> > > > hmm_range_fault().
> > > > 
> > > > Thanks,
> > > > Thomas
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > ---
> > > > > 
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 2a55edc48a65..3e5e205ee279 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -4678,10 +4678,10 @@ vm_fault_t do_swap_page(struct
> > > > > vm_fault
> > > > > *vmf)
> > > > >                               pte_unmap_unlock(vmf->pte,
> > > > > vmf-
> > > > > > ptl);
> > > > >                               pgmap = page_pgmap(vmf-
> > > > > >page);
> > > > >                               ret = pgmap->ops-
> > > > > > migrate_to_ram(vmf);
> > > > > -                             unlock_page(vmf->page);
> > > > >                               put_page(vmf->page);
> > > > >                       } else {
> > > > > -                             pte_unmap_unlock(vmf->pte,
> > > > > vmf-
> > > > > > ptl);
> > > > > +                             migration_entry_wait(vma-
> > > > > >vm_mm,
> > > > > vmf->pmd,
> > > > > +                                                  vmf-
> > > > > > address);
> > > > >                       }
> > > > >               } else if (softleaf_is_hwpoison(entry)) {
> > > > >                       ret = VM_FAULT_HWPOISON;
> > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > index 5169f9717f60..b676daf0f4e8 100644
> > > > > --- a/mm/migrate.c
> > > > > +++ b/mm/migrate.c
> > > > > @@ -496,8 +496,6 @@ void migration_entry_wait(struct
> > > > > mm_struct
> > > > > *mm,
> > > > > pmd_t *pmd,
> > > > >               goto out;
> > > > >  
> > > > >       entry = softleaf_from_pte(pte);
> > > > > -     if (!softleaf_is_migration(entry))
> > > > > -             goto out;
> > > > >  
> > > > >       migration_entry_wait_on_locked(entry, ptl);
> > > > >       return;

Reply via email to