On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
> 
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
> 
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.

That's certainly not correct. This is a must have for a whole lot of use cases.

Why exactly isn't that working for your use case?

Regards,
Christian.

> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
> 
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
> 
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
> 
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
> 
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
> 
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
> 
>   - scripts/checkpatch.pl --strict --no-tree
>   - git apply --check
>   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>     originally exposed the failure on my Radeon 780M system
> 
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
> 
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
> 
> Yitao Jiang (3):
>   mm/mmu_notifier: let interval notifiers block THP
>   drm/amdgpu: block THP for HSA userptr notifiers
>   drm/amdkfd: block THP for non-replayable SVM ranges
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>  include/linux/huge_mm.h                 |   5 +-
>  include/linux/mmu_notifier.h            |  28 ++++
>  mm/khugepaged.c                         |   9 +-
>  mm/madvise.c                            |   3 +-
>  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>  7 files changed, 286 insertions(+), 24 deletions(-)
> 
> 
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0

Reply via email to