On 6/25/26 12:59, Yitao Jiang wrote: > Hi, > > This series fixes a THP policy problem I found while debugging > frequent ROCm GPU failures on an AMD Radeon 780M system during ML > training. > > Some AMDGPU/KFD user mappings are registered through interval > notifiers and cannot safely tolerate the backing VMA changing from base > pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases. Why exactly isn't that working for your use case? Regards, Christian. > Userspace can > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also > collapse the range, after the GPU mapping has been registered. > > On my system this showed up as asynchronous ROCm/HIP kernel launch > failures, often reported later at a synchronization or copy point. I > expect the issue to be relevant to AMDGPU/KFD mappings on > XNACK-disabled GPUs more generally, because those mappings cannot rely > on replayable GPU faults after a CPU-side THP remap. I have validated > the failure and fix on AMD Radeon 780M / gfx1103. > > Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier > users can ask the MM core to keep the covered VMA range out of THP > while the notifier is active. The MM core applies VM_NOHUGEPAGE and > clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE > over an active opt-in range is treated as an ignored hint, and > MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks. > > Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior: > HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and > GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their > current behavior. > > This does not disable THP globally and does not add work to GPU > command submission or kernel launch paths. Additional work is limited > to opt-in notifier registration, opt-in notifier flag transitions, and > MADV_HUGEPAGE attempts that overlap an active opt-in range. > > I tested this on top of torvalds/linux commit ab9de95c9cf9 with: > > - scripts/checkpatch.pl --strict --no-tree > - git apply --check > - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y, > DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects > - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that > originally exposed the failure on my Radeon 780M system > > The standalone reproducers depend on ROCm userspace libraries, so I > have not included them in this series. I can send them separately if > useful. > > This series was prepared with assistance from OpenAI Codex (GPT-5.5). > I reviewed the resulting code and take responsibility for the > submission. > > Yitao Jiang (3): > mm/mmu_notifier: let interval notifiers block THP > drm/amdgpu: block THP for HSA userptr notifiers > drm/amdkfd: block THP for non-replayable SVM ranges > > drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++- > include/linux/huge_mm.h | 5 +- > include/linux/mmu_notifier.h | 28 ++++ > mm/khugepaged.c | 9 +- > mm/madvise.c | 3 +- > mm/mmu_notifier.c | 204 +++++++++++++++++++++++- > 7 files changed, 286 insertions(+), 24 deletions(-) > > > base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f > -- > 2.53.0
