On 2/1/26 11:24 AM, Matthew Brost wrote:
On Sat, Jan 31, 2026 at 01:42:20PM -0800, John Hubbard wrote:
On 1/31/26 11:00 AM, Matthew Brost wrote:
On Sat, Jan 31, 2026 at 01:57:21PM +0100, Thomas Hellström wrote:
On Fri, 2026-01-30 at 19:01 -0800, John Hubbard wrote:
On 1/30/26 10:00 AM, Andrew Morton wrote:
On Fri, 30 Jan 2026 15:45:29 +0100 Thomas Hellström wrote:
...
I’m not convinced the folio refcount has any bearing if we can take a
sleeping lock in do_swap_page, but perhaps I’m missing something.

So far, I am not able to find a problem with your proposal. So,
something like this I believe could actually work:

I did something slightly more defensive with a refcount protection, but
this seems to work + fix the raised by Thomas and shows no noticeable
performance difference. If we go this route, do_huge_pmd_device_private
would need to be updated with the same pattern as well - I don't have
large device pages enabled in current test branch but would have to test
that part out too.

diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..1e7ccc4a1a6c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                         vmf->page = softleaf_to_page(entry);
                         ret = remove_device_exclusive_entry(vmf);
                 } else if (softleaf_is_device_private(entry)) {
+                       struct dev_pagemap *pgmap;
+
                         if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
                                 /*
                                  * migrate_to_ram is not yet ready to operate
@@ -4670,21 +4672,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                                                         vmf->orig_pte)))
                                 goto unlock;

-                       /*
-                        * Get a page reference while we know the page can't be
-                        * freed.
-                        */
-                       if (trylock_page(vmf->page)) {
-                               struct dev_pagemap *pgmap;
-
-                               get_page(vmf->page);
-                               pte_unmap_unlock(vmf->pte, vmf->ptl);
+                       pte_unmap_unlock(vmf->pte, vmf->ptl);
+                       lock_page(vmf->page);
+                       if (get_page_unless_zero(vmf->page)) {

I think this ordering has a problem, because it releases the PTL before
getting a refcount. That allows another thread to free the page, and
then the lock_page() call here will be doing a use-after-free.

That's why I reversed the order of those, and then as a result the
get_page_unless_zero() also becomes unnecessary.

                                 pgmap = page_pgmap(vmf->page);
                                 ret = pgmap->ops->migrate_to_ram(vmf);
                                 unlock_page(vmf->page);
                                 put_page(vmf->page);
                         } else {
-                               pte_unmap_unlock(vmf->pte, vmf->ptl);
+                               unlock_page(vmf->page);
                         }
                 } else if (softleaf_is_hwpoison(entry)) {
                         ret = VM_FAULT_HWPOISON;

diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..af73430e7888 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4652,6 +4652,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                        vmf->page = softleaf_to_page(entry);
                        ret = remove_device_exclusive_entry(vmf);
                } else if (softleaf_is_device_private(entry)) {
+                       struct dev_pagemap *pgmap;
+
                        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
                                /*
                                 * migrate_to_ram is not yet ready to operate
@@ -4674,18 +4676,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                         * Get a page reference while we know the page can't be
                         * freed.
                         */
-                       if (trylock_page(vmf->page)) {
-                               struct dev_pagemap *pgmap;
-
-                               get_page(vmf->page);
-                               pte_unmap_unlock(vmf->pte, vmf->ptl);
-                               pgmap = page_pgmap(vmf->page);
-                               ret = pgmap->ops->migrate_to_ram(vmf);
-                               unlock_page(vmf->page);
-                               put_page(vmf->page);
-                       } else {
-                               pte_unmap_unlock(vmf->pte, vmf->ptl);
-                       }
+                       get_page(vmf->page);
+                       pte_unmap_unlock(vmf->pte, vmf->ptl);
+                       lock_page(vmf->page);
+                       pgmap = page_pgmap(vmf->page);
+                       ret = pgmap->ops->migrate_to_ram(vmf);
+                       unlock_page(vmf->page);
+                       put_page(vmf->page);
                } else if (softleaf_is_hwpoison(entry)) {
                        ret = VM_FAULT_HWPOISON;
                } else if (softleaf_is_marker(entry)) {


But it looks like an AR for us to try to check how bad
lru_cache_disable() really is. And perhaps compare with an
unconditional lru_add_drain_all() at migration start.

Does anybody know who would be able to tell whether a page refcount
still should block migration (like today) or whether that could
actually be relaxed to a page pincount?

Yes, it really should block migration, see my response above: both
pincount and refcount literally mean "do not move this page".

As an aside because it might help at some point, I'm just now testing a
tiny patchset that uses:

     wait_var_event_killable(&folio->_refcount,
                             folio_ref_count(folio) <= expected)

during migration, paired with:

      wake_up_var(&folio->_refcount) in put_page().

This waits for the expected refcount, instead of doing a blind, tight
retry loop during migration attempts. This lets migration succeed even
when waiting a long time for another caller to release a refcount.

It works well, but of course, it also has a potentially serious
performance cost (which I need to quantify), because it adds cycles to
the put_page() hot path. Which is why I haven't posted it yet, even as
an RFC. It's still in the "is this even reasonable" stage, just food
for thought here.


If you post an RFC we (Intel) can give it try as we have tests that
really stress migration in odd ways and have fairly good metrics to
catch perf issues too.


That would be wonderful, thanks! Testing is hard.

thanks,
--
John Hubbard

Reply via email to