Hi Christian,

I agree that my previous approach was wrong. Sorry about that. Please let me 
clarify the problem I was seeing and how I ended up with that incorrect 
conclusion.

The original problem was not a synthetic THP test. I was running ROCm/PyTorch 
ML training on an AMD Radeon 780M system, and the workload frequently failed 
with asynchronous HIP kernel launch failures. The userspace error usually 
surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, 
but the kernel log showed GPU resets and KFD/MES queue eviction failures.

The relevant kernel messages I repeatedly saw were along these lines:

  MES failed to respond to msg=REMOVE_QUEUE
  MES failed to respond to msg=SUSPEND
  failed to suspend all gangs
  failed to remove hardware queue from MES
  Failed to evict queue
  Failed to evict process queues
  GPU reset begin

While trying to reduce the issue, I saw memory invalidations and THP-related 
page-table/backing-page activity driving the AMDGPU/KFD path through SVM 
eviction. On this system, the path I was looking at was roughly:

  svm_range_cpu_invalidate_pagetables()
    -> svm_range_evict()
    -> kgd2kfd_quiesce_mm()
    -> KFD process queue eviction
    -> MES REMOVE_QUEUE / SUSPEND

One thing that misled me was the XNACK-disabled path. Since the issue appeared 
on an XNACK-disabled APU, and that path requires queue eviction/quiesce when 
CPU page table invalidations affect GPU mappings, I incorrectly thought the 
backing-page change itself was something the driver had to prevent.

Another thing that misled me was that the application was not intentionally 
asking for THP behavior. From the workload’s point of view, these page 
transitions looked unrelated to the model computation. I therefore incorrectly 
assumed that userspace should not be able to change backing-page 
characteristics in a way that affects a driver mapping already registered with 
MMU interval notifiers. I now understand from the MM feedback that this is 
expected behavior, and that the notifier user must handle unmap/remap correctly.

So the more precise problem is that THP/remap is only one way to trigger the 
invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue 
quiesce/eviction path during those invalidations. When that fails, the GPU 
resets, and userspace later observes an asynchronous HIP failure.

Please allow me to continue investigating a more appropriate fix for this 
problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid 
changing MM-core or THP policy semantics.

Regards,
Yitao
________________________________
发件人: Christian König <[email protected]>
发送时间: 2026年6月25日 8:35
收件人: Yitao Jiang <[email protected]>; Alex Deucher 
<[email protected]>; David Airlie <[email protected]>; Simona Vetter 
<[email protected]>; Felix Kuehling <[email protected]>; Andrew Morton 
<[email protected]>; David Hildenbrand <[email protected]>; Lorenzo 
Stoakes <[email protected]>
抄送: Zi Yan <[email protected]>; Baolin Wang <[email protected]>; Liam 
R . Howlett <[email protected]>; Nico Pache <[email protected]>; Ryan Roberts 
<[email protected]>; Dev Jain <[email protected]>; Barry Song 
<[email protected]>; Lance Yang <[email protected]>; Vlastimil Babka 
<[email protected]>; Mike Rapoport <[email protected]>; Suren Baghdasaryan 
<[email protected]>; Michal Hocko <[email protected]>; Jann Horn 
<[email protected]>; [email protected] 
<[email protected]>; [email protected] 
<[email protected]>; [email protected] 
<[email protected]>; [email protected] <[email protected]>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings

On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.

That's certainly not correct. This is a must have for a whole lot of use cases.

Why exactly isn't that working for your use case?

Regards,
Christian.

> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
>
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
>
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
>
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>
>   - scripts/checkpatch.pl --strict --no-tree
>   - git apply --check
>   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>     originally exposed the failure on my Radeon 780M system
>
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
>
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
>
> Yitao Jiang (3):
>   mm/mmu_notifier: let interval notifiers block THP
>   drm/amdgpu: block THP for HSA userptr notifiers
>   drm/amdkfd: block THP for non-replayable SVM ranges
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>  include/linux/huge_mm.h                 |   5 +-
>  include/linux/mmu_notifier.h            |  28 ++++
>  mm/khugepaged.c                         |   9 +-
>  mm/madvise.c                            |   3 +-
>  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>  7 files changed, 286 insertions(+), 24 deletions(-)
>
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0

Reply via email to