On Tue, May 05, 2026 at 09:44:16PM +0300, [email protected] wrote:
> From: Mika Penttilä <[email protected]>
> 
> Currently, the way device page faulting and migration works
> is not optimal, if you want to do both fault handling and
> migration at once.
> 
> Being able to migrate not present pages (or pages mapped with incorrect
> permissions, eg. COW) to the GPU requires doing either of the
> following sequences:
> 
> 1. hmm_range_fault() - fault in non-present pages with correct permissions, 
> etc.
> 2. migrate_vma_*() - migrate the pages
> 
> Or:
> 
> 1. migrate_vma_*() - migrate present pages
> 2. If non-present pages detected by migrate_vma_*():
>    a) call hmm_range_fault() to fault pages in
>    b) call migrate_vma_*() again to migrate now present pages
> 
> The problem with the first sequence is that you always have to do two
> page walks even when most of the time the pages are present or zero page
> mappings so the common case takes a performance hit.
> 
> The second sequence is better for the common case, but far worse if
> pages aren't present because now you have to walk the page tables three
> times (once to find the page is not present, once so hmm_range_fault()
> can find a non-present page to fault in and once again to setup the
> migration). It is also tricky to code correctly. One page table walk
> could costs over 1000 cpu cycles on X86-64, which is a significant hit.
> 
> We should be able to walk the page table once, faulting
> pages in as required and replacing them with migration entries if
> requested.
> 
> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
> which tells to prepare for migration also during fault handling.
> Also, for the migrate_vma_setup() call paths, a flag, MIGRATE_VMA_FAULT,
> is added to tell to add fault handling to migrate.
> 
> One extra benefit of migrating with hmm_range_fault() path
> is the migrate_vma.vma gets populated, so no need to
> retrieve that separataly.
> 
> Tested in X86-64 VM with HMM test device, passing the selftests.
> For performance, the migrate throughput tests from the selftests
> show similar numbers (within error margin) as unmodified kernel.
> Tested also rebased on the
> "Remove device private pages from physical address space" series:
> https://lore.kernel.org/linux-mm/[email protected]/
> plus a small patch to adjust with no problems.
> 
> Changes v9-v10
>   - Fix for issue Intel CI found, forgotten pte_unmap() before
>     migration_entry_wait()
> 
> Changes v8-v9
>   - rebase on drm-tip
>   - fixed uaf around  migrate_vma_split_folio() usage
>   - added missing pmd unlock
> 
> Changes v7-v8
>   - rebase on 7.0
>   - fixed subject in two patches
>   - enhanced commit messages
>   - squashed patch 6 into patch 4 to fix kernel test robot warning
>   - readded dropped Cc block from cover letter
>   - fixed white space
> 
> Changes v6-v7
>   - rebase on 7.0.0-rc6
>   - added documentation and comments
>   - denote to be migrated zero page as HMM_PFN_MIGRATE alone
>   - got rid of HMM_PFN_INOUT_FLAGS movement in patch 2
>   - picked up Acked-By from David for patch 1
>   
> Changes v5-v6
>   - rebase on 7.0.0-rc4
>   - use range based TLB flushing while unmapping ptes
>   - gate migration behind HMM_PFN_REQ_MIGRATE for fault and
>     migrate paths
>   - always infer migration flags from migrate->flags only
> 
> Changes v4-v5
>   - rebase on 6.19
>   - fixed David's email address
>   - fixed link issue without CONFIG_TRANSPARENT_HUGEPAGE
>   - refactored into smaller commits
>   - added more comments to code
> 
> Changes v3-v4:
>   - rebase on 6.19-rc8
>   - fixed issues found by kernel test robot with random configs
>   - fixed typos
> 
> Changes v2-v3:
>   - rebase on 6.19-rc7
>   - fixed issues found by kernel test robot
>   - fixed smatch issues reported by Dan Carpenter <[email protected]>
>   - fixes to lock handling (pmd/pte) on errors
>   - added assertions for pmd/pte lock states
>   - other issues discovered by Matthew, thanks!
> 
> Changes v1-v2:
>   - rebase on 6.19-rc6
>   - fixed issues found by kernel test robot
>   - fixed locking (pmd/ptl) to cover handle_ and prepare_ regions
>     parts if migrating
>   - other issues discovered by Matthew, thanks!
> 
> Changes RFC-v1:
>   - rebase on 6.19-rc5
>   - adjust for the device THP
>   - changes from feedback
> 
> Revisions:
>   - RFC 
> https://lore.kernel.org/linux-mm/[email protected]/
>   - v1: 
> https://lore.kernel.org/all/[email protected]/
>   - v2: 
> https://lore.kernel.org/all/[email protected]/
>   - v3: 
> https://lore.kernel.org/all/[email protected]/
>   - v4: 
> https://lore.kernel.org/all/[email protected]/
>   - v5: 
> https://lore.kernel.org/linux-mm/[email protected]/
>   - v6: 
> https://lore.kernel.org/linux-mm/[email protected]/
>   - v7: 
> https://lore.kernel.org/linux-mm/[email protected]/
>   - v8: 
> https://lore.kernel.org/linux-mm/[email protected]/
>   - v9: 
> https://lore.kernel.org/linux-mm/[email protected]/
> 
> Cc: David Hildenbrand <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Leon Romanovsky <[email protected]>
> Cc: Alistair Popple <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Zi Yan <[email protected]>
> Cc: Matthew Brost <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Lorenzo Stoakes <[email protected]>
> Cc: "Liam R. Howlett" <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> 
> Mika Penttilä (5):
>   mm/Kconfig: changes for migrate on fault for device pages
>   mm: Add helper to convert HMM pfn to migrate pfn
>   mm/hmm: do the plumbing for HMM to participate in migration
>   mm: setup device page migration in HMM pagewalk
>   lib/test_hmm:: add a new testcase for the migrate on fault
> 
>  include/linux/hmm.h                    |  19 +-
>  include/linux/migrate.h                |  26 +-
>  lib/test_hmm.c                         | 101 ++-
>  lib/test_hmm_uapi.h                    |  19 +-
>  mm/Kconfig                             |   2 +
>  mm/hmm.c                               | 836 +++++++++++++++++++++++--
>  mm/migrate_device.c                    | 583 +++--------------
>  tools/testing/selftests/mm/hmm-tests.c |  54 ++
>  8 files changed, 1067 insertions(+), 573 deletions(-)
> 
> drm-tip
> base-commit: 94d56a898a2db27f841b17f6966a81ba502fe63c
> -- 

FYI: While testing with hmm_tests I ran into

[  107.866004] ============================================
[  107.866284] WARNING: possible recursive locking detected
[  107.866577] 7.1.0-rc3-00311-g4277273ca0e1 #12 Not tainted
[  107.866877] --------------------------------------------
[  107.867217] hmm-tests/1098 is trying to acquire lock:
[  107.867491] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: 
dmirror_range_fault+0x147/0x610 [test_hmm] <- line 368 of lib/test_hmm.c
[  107.868076] 
[  107.868076] but task is already holding lock:
[  107.868383] ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: 
dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm] <- line 
1267 of lib/test_hmm.c
[  107.869076] 
[  107.869076] other info that might help us debug this:
[  107.869415]  Possible unsafe locking scenario:
[  107.869415] 
[  107.869729]        CPU0
[  107.869866]        ----
[  107.870054]   lock(&mm->mmap_lock);
[  107.870247]   lock(&mm->mmap_lock);
[  107.870436] 
[  107.870436]  *** DEADLOCK ***
[  107.870436] 
[  107.870743]  May be due to missing lock nesting notation
[  107.870743] 
[  107.871158] 1 lock held by hmm-tests/1098:
[  107.871377]  #0: ffff888113571b38 (&mm->mmap_lock){++++}-{4:4}, at: 
dmirror_fault_and_migrate_to_device.constprop.0+0x3aa/0x6a0 [test_hmm]
[  107.872081] 
[  107.872081] stack backtrace:
[  107.872348] CPU: 1 UID: 0 PID: 1098 Comm: hmm-tests Not tainted 
7.1.0-rc3-00311-g4277273ca0e1 #12 PREEMPT(full) 
[  107.872350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
edk2-20260213-6.fc44 02/13/2026
[  107.872354] Call Trace:
[  107.872357]  <TASK>
[  107.872358]  dump_stack_lvl+0x5d/0x80
[  107.872385]  print_deadlock_bug.cold+0xc0/0xe2
[  107.872393]  __lock_acquire+0x10cf/0x1b90
[  107.872400]  lock_acquire+0x189/0x2f0
[  107.872401]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
[  107.872404]  down_read+0x9b/0x4b0
[  107.872420]  ? dmirror_range_fault+0x147/0x610 [test_hmm]
[  107.872421]  ? lock_acquire+0x189/0x2f0
[  107.872422]  ? __pfx_down_read+0x10/0x10
[  107.872424]  ? __lock_acquire+0x3c2/0x1b90
[  107.872425]  dmirror_range_fault+0x147/0x610 [test_hmm]
[  107.872427]  ? __pfx_down_read+0x10/0x10
[  107.872429]  ? __pfx_dmirror_range_fault+0x10/0x10 [test_hmm]
[  107.872430]  ? __lock_acquire+0x3c2/0x1b90
[  107.872434]  dmirror_fault_and_migrate_to_device.constprop.0+0x3bf/0x6a0 
[test_hmm]
[  107.872436]  ? 
__pfx_dmirror_fault_and_migrate_to_device.constprop.0+0x10/0x10 [test_hmm]
[  107.872439]  ? find_held_lock+0x2b/0x80
[  107.872444]  ? dmirror_device_remove_chunks+0x5b8/0xa00 [test_hmm]
[  107.872445]  ? __is_insn_slot_addr+0xee/0x1f0
[  107.872458]  ? lock_acquire+0x189/0x2f0
[  107.872460]  ? avc_has_extended_perms+0x234/0x1350
[  107.872476]  ? __might_fault+0x89/0x150
[  107.872484]  ? lock_release+0xe1/0x320
[  107.872486]  dmirror_fops_unlocked_ioctl+0x9ba/0xdb0 [test_hmm]
[  107.872488]  ? ioctl_has_perm.constprop.0.isra.0+0x2fe/0x6c0
[  107.872494]  ? __pfx_dmirror_fops_unlocked_ioctl+0x10/0x10 [test_hmm]
[  107.872498]  ? count_memcg_events_mm.constprop.0+0x22/0x1a0
[  107.872499]  ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
[  107.872501]  ? count_memcg_events_mm.constprop.0+0xaa/0x1a0
[  107.872503]  ? lock_release+0xe1/0x320
[  107.872504]  ? find_held_lock+0x2b/0x80
[  107.872506]  ? exc_page_fault+0x7e/0xf0
[  107.872510]  __x64_sys_ioctl+0x13c/0x1d0
[  107.872521]  ? lockdep_hardirqs_on_prepare+0xd9/0x190
[  107.872523]  do_syscall_64+0xf3/0x6a0
[  107.872526]  ? exc_page_fault+0xde/0xf0
[  107.872528]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  107.872529] RIP: 0033:0x7f7381c543ad
[  107.872531] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 
10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 
00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[  107.872532] RSP: 002b:00007ffc3160a9b0 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  107.872539] RAX: ffffffffffffffda RBX: 00007f7381b44000 RCX: 00007f7381c543ad
[  107.872540] RDX: 00007ffc3160aa30 RSI: 00000000c0284803 RDI: 0000000000000022
[  107.872541] RBP: 00007ffc3160aa00 R08: 00000000ffffffff R09: 0000000000000000
[  107.872541] R10: 0000000000000022 R11: 0000000000000246 R12: 00007ffc3160aa24
[  107.872542] R13: 000000000041f380 R14: 0000000000000200 R15: 00007f7381200000
[  107.872544]  </TASK>


Thanks,
Balbir

Reply via email to