If there are MES queue eviction failures, then the root cause is most likely an MES firmware problem or some bug in the driver's interaction with MES. Your application dies in the GPU reset that follows. The MMU notifier handling and THP change is not the root cause. It's only the thing that happens to trigger the MES problem. The same thing could happen with NUMA migrations, applications forking or being terminated with Ctrl+C. In all of these scenarios the driver depends on MES to preempt the user mode queues before the MMU notifier returns.

Regards,
  Felix


On 2026-06-25 09:06, Christian König wrote:
Hi Yitao,

adding Philip Yang.

Thanks for the investigation, that sounds like some kind of bug in the KFD SVM 
handling. The driver should be perfectly capable of handling this.

I strongly suggest to open up a bug report for ROCm and describe how to 
reproduce this, Philip can probably point you to the right location for that.

Regards,
Christian.

On 6/25/26 15:01, 蒋 亦韬 wrote:
Hi Christian,

I agree that my previous approach was wrong. Sorry about that. Please let me 
clarify the problem I was seeing and how I ended up with that incorrect 
conclusion.

The original problem was not a synthetic THP test. I was running ROCm/PyTorch 
ML training on an AMD Radeon 780M system, and the workload frequently failed 
with asynchronous HIP kernel launch failures. The userspace error usually 
surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, 
but the kernel log showed GPU resets and KFD/MES queue eviction failures.

The relevant kernel messages I repeatedly saw were along these lines:

   MES failed to respond to msg=REMOVE_QUEUE
   MES failed to respond to msg=SUSPEND
   failed to suspend all gangs
   failed to remove hardware queue from MES
   Failed to evict queue
   Failed to evict process queues
   GPU reset begin

While trying to reduce the issue, I saw memory invalidations and THP-related 
page-table/backing-page activity driving the AMDGPU/KFD path through SVM 
eviction. On this system, the path I was looking at was roughly:

   svm_range_cpu_invalidate_pagetables()
     -> svm_range_evict()
     -> kgd2kfd_quiesce_mm()
     -> KFD process queue eviction
     -> MES REMOVE_QUEUE / SUSPEND

One thing that misled me was the XNACK-disabled path. Since the issue appeared 
on an XNACK-disabled APU, and that path requires queue eviction/quiesce when 
CPU page table invalidations affect GPU mappings, I incorrectly thought the 
backing-page change itself was something the driver had to prevent.

Another thing that misled me was that the application was not intentionally 
asking for THP behavior. From the workload’s point of view, these page 
transitions looked unrelated to the model computation. I therefore incorrectly 
assumed that userspace should not be able to change backing-page 
characteristics in a way that affects a driver mapping already registered with 
MMU interval notifiers. I now understand from the MM feedback that this is 
expected behavior, and that the notifier user must handle unmap/remap correctly.

So the more precise problem is that THP/remap is only one way to trigger the 
invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue 
quiesce/eviction path during those invalidations. When that fails, the GPU 
resets, and userspace later observes an asynchronous HIP failure.

Please allow me to continue investigating a more appropriate fix for this 
problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid 
changing MM-core or THP policy semantics.

Regards,
Yitao
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*发件人:* Christian König <[email protected]>
*发送时间:* 2026年6月25日 8:35
*收件人:* Yitao Jiang <[email protected]>; Alex Deucher <[email protected]>; David Airlie 
<[email protected]>; Simona Vetter <[email protected]>; Felix Kuehling <[email protected]>; Andrew Morton 
<[email protected]>; David Hildenbrand <[email protected]>; Lorenzo Stoakes <[email protected]>
*抄送:* Zi Yan <[email protected]>; Baolin Wang <[email protected]>; Liam R . Howlett <[email protected]>; Nico Pache <[email protected]>; Ryan 
Roberts <[email protected]>; Dev Jain <[email protected]>; Barry Song <[email protected]>; Lance Yang <[email protected]>; Vlastimil Babka 
<[email protected]>; Mike Rapoport <[email protected]>; Suren Baghdasaryan <[email protected]>; Michal Hocko <[email protected]>; Jann Horn 
<[email protected]>; [email protected] <[email protected]>; [email protected] <[email protected]>; 
[email protected] <[email protected]>; [email protected] <[email protected]>
*主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user 
mappings
On 6/25/26 12:59, Yitao Jiang wrote:
Hi,

This series fixes a THP policy problem I found while debugging
frequent ROCm GPU failures on an AMD Radeon 780M system during ML
training.

Some AMDGPU/KFD user mappings are registered through interval
notifiers and cannot safely tolerate the backing VMA changing from base
pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases.

Why exactly isn't that working for your use case?

Regards,
Christian.

Userspace can
still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
collapse the range, after the GPU mapping has been registered.

On my system this showed up as asynchronous ROCm/HIP kernel launch
failures, often reported later at a synchronization or copy point. I
expect the issue to be relevant to AMDGPU/KFD mappings on
XNACK-disabled GPUs more generally, because those mappings cannot rely
on replayable GPU faults after a CPU-side THP remap. I have validated
the failure and fix on AMD Radeon 780M / gfx1103.

Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
users can ask the MM core to keep the covered VMA range out of THP
while the notifier is active. The MM core applies VM_NOHUGEPAGE and
clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
over an active opt-in range is treated as an ignored hint, and
MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.

Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
current behavior.

This does not disable THP globally and does not add work to GPU
command submission or kernel launch paths. Additional work is limited
to opt-in notifier registration, opt-in notifier flag transitions, and
MADV_HUGEPAGE attempts that overlap an active opt-in range.

I tested this on top of torvalds/linux commit ab9de95c9cf9 with:

    - scripts/checkpatch.pl --strict --no-tree
    - git apply --check
    - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
      DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
    - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
      originally exposed the failure on my Radeon 780M system

The standalone reproducers depend on ROCm userspace libraries, so I
have not included them in this series. I can send them separately if
useful.

This series was prepared with assistance from OpenAI Codex (GPT-5.5).
I reviewed the resulting code and take responsibility for the
submission.

Yitao Jiang (3):
    mm/mmu_notifier: let interval notifiers block THP
    drm/amdgpu: block THP for HSA userptr notifiers
    drm/amdkfd: block THP for non-replayable SVM ranges

   drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
   drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
   include/linux/huge_mm.h                 |   5 +-
   include/linux/mmu_notifier.h            |  28 ++++
   mm/khugepaged.c                         |   9 +-
   mm/madvise.c                            |   3 +-
   mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
   7 files changed, 286 insertions(+), 24 deletions(-)


base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
--
2.53.0

Reply via email to