On 2/3/26 6:34 AM, Thomas Hellström wrote: > If hmm_range_fault() fails a folio_trylock() in do_swap_page, > trying to acquire the lock of a device-private folio for migration, > to ram, the function will spin until it succeeds grabbing the lock. > > However, if the process holding the lock is depending on a work > item to be completed, which is scheduled on the same CPU as the > spinning hmm_range_fault(), that work item might be starved and > we end up in a livelock / starvation situation which is never > resolved. > > This can happen, for example if the process holding the > device-private folio lock is stuck in > migrate_device_unmap()->lru_add_drain_all() > The lru_add_drain_all() function requires a short work-item > to be run on all online cpus to complete. > > A prerequisite for this to happen is: > a) Both zone device and system memory folios are considered in > migrate_device_unmap(), so that there is a reason to call > lru_add_drain_all() for a system memory folio while a > folio lock is held on a zone device folio. > b) The zone device folio has an initial mapcount > 1 which causes > at least one migration PTE entry insertion to be deferred to > try_to_migrate(), which can happen after the call to > lru_add_drain_all(). > c) No or voluntary only preemption. > > This all seems pretty unlikely to happen, but indeed is hit by > the "xe_exec_system_allocator" igt test. > > Resolve this by waiting for the folio to be unlocked if the > folio_trylock() fails in the do_swap_page() function. > > Future code improvements might consider moving > the lru_add_drain_all() call in migrate_device_unmap() to be > called *after* all pages have migration entries inserted. > That would eliminate also b) above. > > v2: > - Instead of a cond_resched() in the hmm_range_fault() function, > eliminate the problem by waiting for the folio to be unlocked > in do_swap_page() (Alistair Popple, Andrew Morton) > v3: > - Add a stub migration_entry_wait_on_locked() for the > !CONFIG_MIGRATION case. (Kernel Test Robot) > > Suggested-by: Alistair Popple <[email protected]> > Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in do_swap_page") > Cc: Ralph Campbell <[email protected]> > Cc: Christoph Hellwig <[email protected]> > Cc: Jason Gunthorpe <[email protected]> > Cc: Jason Gunthorpe <[email protected]> > Cc: Leon Romanovsky <[email protected]> > Cc: Andrew Morton <[email protected]> > Cc: Matthew Brost <[email protected]> > Cc: John Hubbard <[email protected]> > Cc: Alistair Popple <[email protected]> > Cc: [email protected] > Cc: <[email protected]> > Signed-off-by: Thomas Hellström <[email protected]> > Cc: <[email protected]> # v6.15+ > --- > include/linux/migrate.h | 6 ++++++ > mm/memory.c | 3 ++- > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 26ca00c325d9..800ec174b601 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -97,6 +97,12 @@ static inline int set_movable_ops(const struct > movable_operations *ops, enum pag > return -ENOSYS; > } > > +static inline void migration_entry_wait_on_locked(softleaf_t entry, > spinlock_t *ptl) > + __releases(ptl) > +{ > + spin_unlock(ptl); > +} > + > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_NUMA_BALANCING > diff --git a/mm/memory.c b/mm/memory.c > index da360a6eb8a4..ed20da5570d5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4684,7 +4684,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > unlock_page(vmf->page); > put_page(vmf->page); > } else { > - pte_unmap_unlock(vmf->pte, vmf->ptl); > + pte_unmap(vmf->pte); > + migration_entry_wait_on_locked(entry, vmf->ptl);
This is neatly done. Reviewed-by: John Hubbard <[email protected]> thanks, -- John Hubbard > } > } else if (softleaf_is_hwpoison(entry)) { > ret = VM_FAULT_HWPOISON;
