AMD General

Hi Gerhard:

I think the cause is checking the last byte address of svm range for 2MB 
alignment when decide possible huge page mapping. Your test case has vm range 
that ends just one byte before alignment.

I tested your app with the attachment, no page fault during sdma operation. 
Please verify it.

Thanks
Xiaogang

From: Chen, Xiaogang
Sent: Wednesday, June 3, 2026 5:51 PM
To: Gerhard Schwanzer <[email protected]>; [email protected]
Cc: [email protected]; [email protected]; Deucher, Alexander 
<[email protected]>; Yang, Philip <[email protected]>
Subject: Re: [REGRESSION] drm/amdkfd: SVM split-tail remap regression causes 
SDMA0 permission fault on RX 7600 XT


Hi Gerhard:

Thanks. I can build the app now. And I saw the regression. I am triaging it.

The purpose of this patch is to remap split svm ranges(head/tail) that were 
mapped with huge page mapping(pmd), but cannot be mapped in huge page mapping 
after split due to new svm ranges are not 2MB aligned. It seems the remap 
decision misses case that both head and tail ranges are from original range 
with huge page mappings were used. Will check....

Regards

Xiaogang


On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote:

[Some people who received this message don't often get email from 
[email protected]<mailto:[email protected]>. Learn why this is important at 
https://aka.ms/LearnAboutSenderIdentification ]



Hi Xiaogang,



Sorry, you are right. The source I uploaded was not self-contained, it still

referenced trace_history_replay.inc from an older local replay mode.



I uploaded a self-contained v2 source to the GitLab report:



https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c



The --upstream-ab path does not use that replay table, but the missing

include

obviously broke fresh builds. The v2 source embeds the table and otherwise

preserves the same source.



I re-tested this v2 source before uploading:



   - clean build from only kfd_svm_split_hsa_copy-v2.c: OK

   - ./kfd_svm_split_hsa_copy --help: OK

   - good/workaround kernel: --upstream-ab completed 10/10 runs, no new

     GCVM/SDMA0/protection-fault messages in the test window

   - broken kernel: --upstream-ab reproduced the SDMA0 permission fault;

     the first kernel fault address matched the planned split-tail page



Validation summaries:



https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md

https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md



The intended triage command remains:



   ./kfd_svm_split_hsa_copy --upstream-ab



Generic build shape is:



   cc -O2 -g -Wall -Wextra -pthread \

     -I/path/to/rocm/include -L/path/to/rocm/lib \

     -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \

     -lhsa-runtime64



If you still prefer a binary, please tell me the target runtime/distro. A

binary built on my NixOS system is Nix-store linked and likely not

portable to

your test system.



One more thing that would help me test any replacement fix: do you know what

specific failure or workload 448ee453 was intended to fix? I would like to

avoid validating only the revert side while accidentally losing the original

fix.



Thanks for catching this, and thanks for taking a look.



Regards,

Gerhard





On 06/03/2026 Chen, Xiaogang wrote:



I cannot compile kfd_svm_split_hsa_copy.c, there is no

"trace_history_replay.inc".



Or can you  send the test binary?  That should be enough to triage the

issue since it is a regression as you mentioned.



Regards



Xiaogang



On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:

Hi,



I would like to make sure this AMDKFD SVM regression is tracked by the

Linux regression process.



GitLab report:



   https://gitlab.freedesktop.org/drm/amd/-/work_items/4914



The regression was originally reported on 2026-01-27. It was bisected

to the

same functional change that Alex Deucher's revert patch later targeted:



   448ee45353ef9fb1a34f5f26eb3f48923c6f0898

   drm/amdkfd: Use huge page size to check split svm range alignment



The affected kernel line I tested identifies the same change as:



   bf2084a7b1d75d093b6a79df4c10142d49fbaa0e



Alex's revert patch:



https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html



A small C/HSA reproducer is now available in the GitLab report. It

does not

require PyTorch, ComfyUI, Docker, model files, or the original

workload. It

uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM

SET_ATTR ioctls, and an HSA SDMA D2H copy.



Single reproducer command, same binary on both kernels:



   ./kfd_svm_split_hsa_copy --upstream-ab



Same-machine A/B result on an RX 7600 XT:



   448ee453/bf2084a7 active:

     1/1 run faults with SDMA0 permission fault

     GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51



   448ee453/bf2084a7 locally reverted:

     10/10 runs complete

     no ROCr memory access fault

     no new GCVM/SDMA0 permission fault in dmesg



The bad fault page is inside the split tail and inside the SDMA copy

range:



   critical tail: [0x722429d61..0x722429dff]

   copy pages:    [0x722429b30..0x722429d70]

   fault page:    0x722429d65



A full ftrace/PTE run with the same C reproducer/SVM sequence also shows:



   split_tail ... current_remap=0 old_remap=1 missed=1

   MISSED_REMAP_CANDIDATE split=tail

   no amdgpu_vm_update_ptes covering the fault page after the marker

before

   the fault-side GET_ATTR



The suspected code issue is that the split-tail/head remap predicate

introduced

by 448ee453/bf2084a7 can miss tails inside the final 512-page block.

Since

prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start

of the

final block, not an exclusive upper bound.



I also sent a short follow-up to amd-gfx with the reproducer/A-B

summary and

asked what original failure or workload 448ee453/bf2084a7 was intended

to fix:



https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html



I can resend the reproducer source and summaries directly on-list if

preferred.



#regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898

#regzbot monitor:

https://gitlab.freedesktop.org/drm/amd/-/work_items/4914



Thanks,

Gerhard Schwanzer


Attachment: 0001-drm-amdkfd-Use-last-1-of-vm-range-to-check-2MB-huge-.patch
Description: 0001-drm-amdkfd-Use-last-1-of-vm-range-to-check-2MB-huge-.patch

Reply via email to