On 2026-02-02 at 08:07 +1100, Matthew Brost <[email protected]> wrote...
> On Sun, Feb 01, 2026 at 12:48:33PM -0800, John Hubbard wrote:
> > On 2/1/26 11:24 AM, Matthew Brost wrote:
> > > On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
> > > > On 1/31/26 11:00 AM, Matthew Brost wrote:
> > > > > On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
> > > > > > On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
> > > > > > > On 1/30/26 10:00 AM, Andrew Morton wrote:
> > > > > > > > On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström wrote:
> > > > > > > ...
> > > > > I’m not convinced the folio refcount has any bearing if we can take a
> > > > > sleeping lock in do_swap_page, but perhaps I’m missing something.
I think the point of the trylock vs. lock is that if you can't immediately
lock the page then it's an indication the page is undergoing a migration.
In other words there's no point waiting for the lock and then trying to call
migrate_to_ram() as the page will have already moved by the time you acquire
the lock. Of course that just means you spin faulting until the page finally
migrates.
If I'm understanding the problem it sounds like we just want to sleep until the
migration is complete, ie. same as the migration entry path. We don't have a
device_private_entry_wait() function, but I don't think we need one, see below.
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index da360a6eb8a4..1e7ccc4a1a6c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > vmf->page = softleaf_to_page(entry);
> > > ret = remove_device_exclusive_entry(vmf);
> > > } else if (softleaf_is_device_private(entry)) {
> > > + struct dev_pagemap *pgmap;
> > > +
> > > if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > /*
> > > * migrate_to_ram is not yet ready to
> > > operate
> > > @@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > vmf->orig_pte)))
> > > goto unlock;
> > >
> > > - /*
> > > - * Get a page reference while we know the page
> > > can't be
> > > - * freed.
> > > - */
> > > - if (trylock_page(vmf->page)) {
> > > - struct dev_pagemap *pgmap;
> > > -
> > > - get_page(vmf->page);
At this point we:
1. Know the page needs to migrate
2. Have the page locked
3. Have a reference on the page
4. Have the PTL locked
Or in other words we have everything we need to install a migration entry,
so why not just do that? This thread would then proceed into migrate_to_ram()
having already done migrate_vma_collect_pmd() for the faulting page and any
other threads would just sleep in the wait on migration entry path until the
migration is complete, avoiding the livelock problem the trylock was introduced
for in 1afaeb8293c9a.
- Alistair
> > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + lock_page(vmf->page);
> > > + if (get_page_unless_zero(vmf->page)) {
> >
> > I think this ordering has a problem, because it releases the PTL before
> > getting a refcount. That allows another thread to free the page, and
>
> Yes, I reasoned that this could be a problem too after thinking about it
> a bit more. The issue with taking a refcount without the lock is that
> we’re back to the livelock problem that was solved here:
>
> git format-patch -1 1afaeb8293c9a
>
> > then the lock_page() call here will be doing a use-after-free.
> >
>
> I don’t think it’s a use-after-free per se; rather, the page could have
> moved and been reallocated. If the same_pte check were moved under the
> page lock, I think it would largely solve that, but if the page were
> reallocated as a larger folio, the page lock might collide with the
> folio-order bit encoding and hang forever. This is likely extremely hard
> to hit, as you’d need multiple threads faulting the same page plus the
> driver reallocating the page as a folio at the same time, but
> nonetheless it could be a problem.
>
> So maybe back to drawing board...
>
> Matt
>
> > That's why I reversed the order of those, and then as a result the
> > get_page_unless_zero() also becomes unnecessary.
> >
> > > pgmap = page_pgmap(vmf->page);
> > > ret = pgmap->ops->migrate_to_ram(vmf);
> > > unlock_page(vmf->page);
> > > put_page(vmf->page);
> > > } else {
> > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > + unlock_page(vmf->page);
> > > }
> > > } else if (softleaf_is_hwpoison(entry)) {
> > > ret = VM_FAULT_HWPOISON;
> > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index da360a6eb8a4..af73430e7888 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > vmf->page = softleaf_to_page(entry);
> > > > ret = remove_device_exclusive_entry(vmf);
> > > > } else if (softleaf_is_device_private(entry)) {
> > > > + struct dev_pagemap *pgmap;
> > > > +
> > > > if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > /*
> > > > * migrate_to_ram is not yet ready to
> > > > operate
> > > > @@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > * Get a page reference while we know the page
> > > > can't be
> > > > * freed.
> > > > */
> > > > - if (trylock_page(vmf->page)) {
> > > > - struct dev_pagemap *pgmap;
> > > > -
> > > > - get_page(vmf->page);
> > > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > - pgmap = page_pgmap(vmf->page);
> > > > - ret = pgmap->ops->migrate_to_ram(vmf);
> > > > - unlock_page(vmf->page);
> > > > - put_page(vmf->page);
> > > > - } else {
> > > > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > - }
> > > > + get_page(vmf->page);
> > > > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > + lock_page(vmf->page);
> > > > + pgmap = page_pgmap(vmf->page);
> > > > + ret = pgmap->ops->migrate_to_ram(vmf);
> > > > + unlock_page(vmf->page);
> > > > + put_page(vmf->page);
> > > > } else if (softleaf_is_hwpoison(entry)) {
> > > > ret = VM_FAULT_HWPOISON;
> > > > } else if (softleaf_is_marker(entry)) {
> > > >
> > > > >
> > > > > > But it looks like an AR for us to try to check how bad
> > > > > > lru_cache_disable() really is. And perhaps compare with an
> > > > > > unconditional lru_add_drain_all() at migration start.
> > > > > >
> > > > > > Does anybody know who would be able to tell whether a page refcount
> > > > > > still should block migration (like today) or whether that could
> > > > > > actually be relaxed to a page pincount?
> > > >
> > > > Yes, it really should block migration, see my response above: both
> > > > pincount and refcount literally mean "do not move this page".
> > > >
> > > > As an aside because it might help at some point, I'm just now testing a
> > > > tiny patchset that uses:
> > > >
> > > > wait_var_event_killable(&folio->_refcount,
> > > > folio_ref_count(folio) <= expected)
> > > >
> > > > during migration, paired with:
> > > >
> > > > wake_up_var(&folio->_refcount) in put_page().
> > > >
> > > > This waits for the expected refcount, instead of doing a blind, tight
> > > > retry loop during migration attempts. This lets migration succeed even
> > > > when waiting a long time for another caller to release a refcount.
> > > >
> > > > It works well, but of course, it also has a potentially serious
> > > > performance cost (which I need to quantify), because it adds cycles to
> > > > the put_page() hot path. Which is why I haven't posted it yet, even as
> > > > an RFC. It's still in the "is this even reasonable" stage, just food
> > > > for thought here.
> > > >
> > >
> > > If you post an RFC we (Intel) can give it try as we have tests that
> > > really stress migration in odd ways and have fairly good metrics to
> > > catch perf issues too.
> > >
> >
> > That would be wonderful, thanks! Testing is hard.
> >
> > thanks,
> > --
> > John Hubbard
> >
>