Hi Yitao,

adding Philip Yang.

Thanks for the investigation, that sounds like some kind of bug in the KFD SVM 
handling. The driver should be perfectly capable of handling this.

I strongly suggest to open up a bug report for ROCm and describe how to 
reproduce this, Philip can probably point you to the right location for that.

Regards,
Christian.

On 6/25/26 15:01, 蒋 亦韬 wrote:
> Hi Christian,
> 
> I agree that my previous approach was wrong. Sorry about that. Please let me 
> clarify the problem I was seeing and how I ended up with that incorrect 
> conclusion.
> 
> The original problem was not a synthetic THP test. I was running ROCm/PyTorch 
> ML training on an AMD Radeon 780M system, and the workload frequently failed 
> with asynchronous HIP kernel launch failures. The userspace error usually 
> surfaced later in PyTorch, for example around a copy/to_device/SetDevice 
> path, but the kernel log showed GPU resets and KFD/MES queue eviction 
> failures.
> 
> The relevant kernel messages I repeatedly saw were along these lines:
> 
>   MES failed to respond to msg=REMOVE_QUEUE
>   MES failed to respond to msg=SUSPEND
>   failed to suspend all gangs
>   failed to remove hardware queue from MES
>   Failed to evict queue
>   Failed to evict process queues
>   GPU reset begin
> 
> While trying to reduce the issue, I saw memory invalidations and THP-related 
> page-table/backing-page activity driving the AMDGPU/KFD path through SVM 
> eviction. On this system, the path I was looking at was roughly:
> 
>   svm_range_cpu_invalidate_pagetables()
>     -> svm_range_evict()
>     -> kgd2kfd_quiesce_mm()
>     -> KFD process queue eviction
>     -> MES REMOVE_QUEUE / SUSPEND
> 
> One thing that misled me was the XNACK-disabled path. Since the issue 
> appeared on an XNACK-disabled APU, and that path requires queue 
> eviction/quiesce when CPU page table invalidations affect GPU mappings, I 
> incorrectly thought the backing-page change itself was something the driver 
> had to prevent.
> 
> Another thing that misled me was that the application was not intentionally 
> asking for THP behavior. From the workload’s point of view, these page 
> transitions looked unrelated to the model computation. I therefore 
> incorrectly assumed that userspace should not be able to change backing-page 
> characteristics in a way that affects a driver mapping already registered 
> with MMU interval notifiers. I now understand from the MM feedback that this 
> is expected behavior, and that the notifier user must handle unmap/remap 
> correctly.
> 
> So the more precise problem is that THP/remap is only one way to trigger the 
> invalidation path. What is failing for my workload is the AMDGPU/KFD/MES 
> queue quiesce/eviction path during those invalidations. When that fails, the 
> GPU resets, and userspace later observes an asynchronous HIP failure.
> 
> Please allow me to continue investigating a more appropriate fix for this 
> problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid 
> changing MM-core or THP policy semantics.
> 
> Regards,
> Yitao
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *发件人:* Christian König <[email protected]>
> *发送时间:* 2026年6月25日 8:35
> *收件人:* Yitao Jiang <[email protected]>; Alex Deucher 
> <[email protected]>; David Airlie <[email protected]>; Simona Vetter 
> <[email protected]>; Felix Kuehling <[email protected]>; Andrew Morton 
> <[email protected]>; David Hildenbrand <[email protected]>; Lorenzo 
> Stoakes <[email protected]>
> *抄送:* Zi Yan <[email protected]>; Baolin Wang <[email protected]>; 
> Liam R . Howlett <[email protected]>; Nico Pache <[email protected]>; Ryan 
> Roberts <[email protected]>; Dev Jain <[email protected]>; Barry Song 
> <[email protected]>; Lance Yang <[email protected]>; Vlastimil Babka 
> <[email protected]>; Mike Rapoport <[email protected]>; Suren Baghdasaryan 
> <[email protected]>; Michal Hocko <[email protected]>; Jann Horn 
> <[email protected]>; [email protected] 
> <[email protected]>; [email protected] 
> <[email protected]>; [email protected] 
> <[email protected]>; [email protected] <[email protected]>
> *主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user 
> mappings
>  
> On 6/25/26 12:59, Yitao Jiang wrote:
>> Hi,
>> 
>> This series fixes a THP policy problem I found while debugging
>> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
>> training.
>> 
>> Some AMDGPU/KFD user mappings are registered through interval
>> notifiers and cannot safely tolerate the backing VMA changing from base
>> pages to a transparent huge page after registration.
> 
> That's certainly not correct. This is a must have for a whole lot of use 
> cases.
> 
> Why exactly isn't that working for your use case?
> 
> Regards,
> Christian.
> 
>> Userspace can
>> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
>> collapse the range, after the GPU mapping has been registered.
>> 
>> On my system this showed up as asynchronous ROCm/HIP kernel launch
>> failures, often reported later at a synchronization or copy point. I
>> expect the issue to be relevant to AMDGPU/KFD mappings on
>> XNACK-disabled GPUs more generally, because those mappings cannot rely
>> on replayable GPU faults after a CPU-side THP remap. I have validated
>> the failure and fix on AMD Radeon 780M / gfx1103.
>> 
>> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
>> users can ask the MM core to keep the covered VMA range out of THP
>> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
>> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
>> over an active opt-in range is treated as an ignored hint, and
>> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>> 
>> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
>> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
>> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
>> current behavior.
>> 
>> This does not disable THP globally and does not add work to GPU
>> command submission or kernel launch paths. Additional work is limited
>> to opt-in notifier registration, opt-in notifier flag transitions, and
>> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>> 
>> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>> 
>>   - scripts/checkpatch.pl --strict --no-tree
>>   - git apply --check
>>   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>>     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>>   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>>     originally exposed the failure on my Radeon 780M system
>> 
>> The standalone reproducers depend on ROCm userspace libraries, so I
>> have not included them in this series. I can send them separately if
>> useful.
>> 
>> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
>> I reviewed the resulting code and take responsibility for the
>> submission.
>> 
>> Yitao Jiang (3):
>>   mm/mmu_notifier: let interval notifiers block THP
>>   drm/amdgpu: block THP for HSA userptr notifiers
>>   drm/amdkfd: block THP for non-replayable SVM ranges
>> 
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>>  include/linux/huge_mm.h                 |   5 +-
>>  include/linux/mmu_notifier.h            |  28 ++++
>>  mm/khugepaged.c                         |   9 +-
>>  mm/madvise.c                            |   3 +-
>>  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>>  7 files changed, 286 insertions(+), 24 deletions(-)
>> 
>> 
>> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
>> --
>> 2.53.0
> 

Reply via email to