Hi Yitao, adding Philip Yang.
Thanks for the investigation, that sounds like some kind of bug in the KFD SVM handling. The driver should be perfectly capable of handling this. I strongly suggest to open up a bug report for ROCm and describe how to reproduce this, Philip can probably point you to the right location for that. Regards, Christian. On 6/25/26 15:01, 蒋 亦韬 wrote: > Hi Christian, > > I agree that my previous approach was wrong. Sorry about that. Please let me > clarify the problem I was seeing and how I ended up with that incorrect > conclusion. > > The original problem was not a synthetic THP test. I was running ROCm/PyTorch > ML training on an AMD Radeon 780M system, and the workload frequently failed > with asynchronous HIP kernel launch failures. The userspace error usually > surfaced later in PyTorch, for example around a copy/to_device/SetDevice > path, but the kernel log showed GPU resets and KFD/MES queue eviction > failures. > > The relevant kernel messages I repeatedly saw were along these lines: > > MES failed to respond to msg=REMOVE_QUEUE > MES failed to respond to msg=SUSPEND > failed to suspend all gangs > failed to remove hardware queue from MES > Failed to evict queue > Failed to evict process queues > GPU reset begin > > While trying to reduce the issue, I saw memory invalidations and THP-related > page-table/backing-page activity driving the AMDGPU/KFD path through SVM > eviction. On this system, the path I was looking at was roughly: > > svm_range_cpu_invalidate_pagetables() > -> svm_range_evict() > -> kgd2kfd_quiesce_mm() > -> KFD process queue eviction > -> MES REMOVE_QUEUE / SUSPEND > > One thing that misled me was the XNACK-disabled path. Since the issue > appeared on an XNACK-disabled APU, and that path requires queue > eviction/quiesce when CPU page table invalidations affect GPU mappings, I > incorrectly thought the backing-page change itself was something the driver > had to prevent. > > Another thing that misled me was that the application was not intentionally > asking for THP behavior. From the workload’s point of view, these page > transitions looked unrelated to the model computation. I therefore > incorrectly assumed that userspace should not be able to change backing-page > characteristics in a way that affects a driver mapping already registered > with MMU interval notifiers. I now understand from the MM feedback that this > is expected behavior, and that the notifier user must handle unmap/remap > correctly. > > So the more precise problem is that THP/remap is only one way to trigger the > invalidation path. What is failing for my workload is the AMDGPU/KFD/MES > queue quiesce/eviction path during those invalidations. When that fails, the > GPU resets, and userspace later observes an asynchronous HIP failure. > > Please allow me to continue investigating a more appropriate fix for this > problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid > changing MM-core or THP policy semantics. > > Regards, > Yitao > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > *发件人:* Christian König <[email protected]> > *发送时间:* 2026年6月25日 8:35 > *收件人:* Yitao Jiang <[email protected]>; Alex Deucher > <[email protected]>; David Airlie <[email protected]>; Simona Vetter > <[email protected]>; Felix Kuehling <[email protected]>; Andrew Morton > <[email protected]>; David Hildenbrand <[email protected]>; Lorenzo > Stoakes <[email protected]> > *抄送:* Zi Yan <[email protected]>; Baolin Wang <[email protected]>; > Liam R . Howlett <[email protected]>; Nico Pache <[email protected]>; Ryan > Roberts <[email protected]>; Dev Jain <[email protected]>; Barry Song > <[email protected]>; Lance Yang <[email protected]>; Vlastimil Babka > <[email protected]>; Mike Rapoport <[email protected]>; Suren Baghdasaryan > <[email protected]>; Michal Hocko <[email protected]>; Jann Horn > <[email protected]>; [email protected] > <[email protected]>; [email protected] > <[email protected]>; [email protected] > <[email protected]>; [email protected] <[email protected]> > *主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user > mappings > > On 6/25/26 12:59, Yitao Jiang wrote: >> Hi, >> >> This series fixes a THP policy problem I found while debugging >> frequent ROCm GPU failures on an AMD Radeon 780M system during ML >> training. >> >> Some AMDGPU/KFD user mappings are registered through interval >> notifiers and cannot safely tolerate the backing VMA changing from base >> pages to a transparent huge page after registration. > > That's certainly not correct. This is a must have for a whole lot of use > cases. > > Why exactly isn't that working for your use case? > > Regards, > Christian. > >> Userspace can >> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also >> collapse the range, after the GPU mapping has been registered. >> >> On my system this showed up as asynchronous ROCm/HIP kernel launch >> failures, often reported later at a synchronization or copy point. I >> expect the issue to be relevant to AMDGPU/KFD mappings on >> XNACK-disabled GPUs more generally, because those mappings cannot rely >> on replayable GPU faults after a CPU-side THP remap. I have validated >> the failure and fix on AMD Radeon 780M / gfx1103. >> >> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier >> users can ask the MM core to keep the covered VMA range out of THP >> while the notifier is active. The MM core applies VM_NOHUGEPAGE and >> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE >> over an active opt-in range is treated as an ignored hint, and >> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks. >> >> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior: >> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and >> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their >> current behavior. >> >> This does not disable THP globally and does not add work to GPU >> command submission or kernel launch paths. Additional work is limited >> to opt-in notifier registration, opt-in notifier flag transitions, and >> MADV_HUGEPAGE attempts that overlap an active opt-in range. >> >> I tested this on top of torvalds/linux commit ab9de95c9cf9 with: >> >> - scripts/checkpatch.pl --strict --no-tree >> - git apply --check >> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y, >> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects >> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that >> originally exposed the failure on my Radeon 780M system >> >> The standalone reproducers depend on ROCm userspace libraries, so I >> have not included them in this series. I can send them separately if >> useful. >> >> This series was prepared with assistance from OpenAI Codex (GPT-5.5). >> I reviewed the resulting code and take responsibility for the >> submission. >> >> Yitao Jiang (3): >> mm/mmu_notifier: let interval notifiers block THP >> drm/amdgpu: block THP for HSA userptr notifiers >> drm/amdkfd: block THP for non-replayable SVM ranges >> >> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++- >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++- >> include/linux/huge_mm.h | 5 +- >> include/linux/mmu_notifier.h | 28 ++++ >> mm/khugepaged.c | 9 +- >> mm/madvise.c | 3 +- >> mm/mmu_notifier.c | 204 +++++++++++++++++++++++- >> 7 files changed, 286 insertions(+), 24 deletions(-) >> >> >> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f >> -- >> 2.53.0 >
