Thank you for the testing/confirming.
Xiaogang
On 6/5/2026 1:41 PM, Gerhard Schwanzer wrote:
Hi Xiaogang, Thanks. I tested your attached patch on my RX 7600 XT
system. Test setup:
-
kernel 7.0.11 with 448ee453/bf2084a7 active
-
local revert not applied
-
your attached candidate fix applied
-
same self-contained v2 reproducer source as before, unchanged sha256:
33347b5a1915f7452417f776c85527e55f825078c146163470bfe3eacabe3b27
Command: ./kfd_svm_split_hsa_copy --upstream-ab Result:
-
10/10 runs completed successfully
-
all HSA/SDMA D2H copies completed
-
no ROCr memory access fault
-
no new GCVM_L2_PROTECTION_FAULT_STATUS
-
no SDMA0 permission fault
-
no GPU page fault in the kernel log So your patch fixes the reproducer
on my system with the original reproducer unchanged. Please feel free to
add: Tested-by: Gerhard Schwanzer
[email protected]
Thanks, Gerhard
On 05/06/26 at 19:59, Chen, Xiaogang wrote:
AMD General
Hi Gerhard:
I think the cause is checking the last byte address of svm range for
2MB alignment when decide possible huge page mapping. Your test case
has vm range that ends just one byte before alignment.
I tested your app with the attachment, no page fault during sdma
operation. Please verify it.
Thanks
Xiaogang
*From:*Chen, Xiaogang
*Sent:* Wednesday, June 3, 2026 5:51 PM
*To:* Gerhard Schwanzer <[email protected]>; [email protected]
*Cc:* [email protected]; [email protected]; Deucher,
Alexander <[email protected]>; Yang, Philip <[email protected]>
*Subject:* Re: [REGRESSION] drm/amdkfd: SVM split-tail remap
regression causes SDMA0 permission fault on RX 7600 XT
Hi Gerhard:
Thanks. I can build the app now. And I saw the regression. I am
triaging it.
The purpose of this patch is to remap split svm ranges(head/tail) that
were mapped with huge page mapping(pmd), but cannot be mapped in huge
page mapping after split due to new svm ranges are not 2MB aligned. It
seems the remap decision misses case that both head and tail ranges
are from original range with huge page mappings were used. Will check....
Regards
Xiaogang
On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote:
[Some people who received this message don't often get email
[email protected]. Learn why this is important
athttps://aka.ms/LearnAboutSenderIdentification ]
Hi Xiaogang,
Sorry, you are right. The source I uploaded was not self-contained, it
still
referenced trace_history_replay.inc from an older local replay mode.
I uploaded a self-contained v2 source to the GitLab report:
https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c
The --upstream-ab path does not use that replay table, but the missing
include
obviously broke fresh builds. The v2 source embeds the table and otherwise
preserves the same source.
I re-tested this v2 source before uploading:
- clean build from only kfd_svm_split_hsa_copy-v2.c: OK
- ./kfd_svm_split_hsa_copy --help: OK
- good/workaround kernel: --upstream-ab completed 10/10 runs, no new
GCVM/SDMA0/protection-fault messages in the test window
- broken kernel: --upstream-ab reproduced the SDMA0 permission fault;
the first kernel fault address matched the planned split-tail page
Validation summaries:
https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md
https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md
The intended triage command remains:
./kfd_svm_split_hsa_copy --upstream-ab
Generic build shape is:
cc -O2 -g -Wall -Wextra -pthread \
-I/path/to/rocm/include -L/path/to/rocm/lib \
-o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \
-lhsa-runtime64
If you still prefer a binary, please tell me the target runtime/distro. A
binary built on my NixOS system is Nix-store linked and likely not
portable to
your test system.
One more thing that would help me test any replacement fix: do you know
what
specific failure or workload 448ee453 was intended to fix? I would like to
avoid validating only the revert side while accidentally losing the
original
fix.
Thanks for catching this, and thanks for taking a look.
Regards,
Gerhard
On 06/03/2026 Chen, Xiaogang wrote:
I cannot compile kfd_svm_split_hsa_copy.c, there is no
"trace_history_replay.inc".
Or can you send the test binary? That should be enough to triage the
issue since it is a regression as you mentioned.
Regards
Xiaogang
On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:
Hi,
I would like to make sure this AMDKFD SVM regression is tracked by
the
Linux regression process.
GitLab report:
https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
The regression was originally reported on 2026-01-27. It was
bisected
to the
same functional change that Alex Deucher's revert patch later
targeted:
448ee45353ef9fb1a34f5f26eb3f48923c6f0898
drm/amdkfd: Use huge page size to check split svm range
alignment
The affected kernel line I tested identifies the same change as:
bf2084a7b1d75d093b6a79df4c10142d49fbaa0e
Alex's revert patch:
https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
A small C/HSA reproducer is now available in the GitLab report. It
does not
require PyTorch, ComfyUI, Docker, model files, or the original
workload. It
uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD
SVM
SET_ATTR ioctls, and an HSA SDMA D2H copy.
Single reproducer command, same binary on both kernels:
./kfd_svm_split_hsa_copy --upstream-ab
Same-machine A/B result on an RX 7600 XT:
448ee453/bf2084a7 active:
1/1 run faults with SDMA0 permission fault
GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51
448ee453/bf2084a7 locally reverted:
10/10 runs complete
no ROCr memory access fault
no new GCVM/SDMA0 permission fault in dmesg
The bad fault page is inside the split tail and inside the SDMA
copy
range:
critical tail: [0x722429d61..0x722429dff]
copy pages: [0x722429b30..0x722429d70]
fault page: 0x722429d65
A full ftrace/PTE run with the same C reproducer/SVM sequence also
shows:
split_tail ... current_remap=0 old_remap=1 missed=1
MISSED_REMAP_CANDIDATE split=tail
no amdgpu_vm_update_ptes covering the fault page after the
marker
before
the fault-side GET_ATTR
The suspected code issue is that the split-tail/head remap
predicate
introduced
by 448ee453/bf2084a7 can miss tails inside the final 512-page
block.
Since
prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the
start
of the
final block, not an exclusive upper bound.
I also sent a short follow-up to amd-gfx with the reproducer/A-B
summary and
asked what original failure or workload 448ee453/bf2084a7 was
intended
to fix:
https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html
I can resend the reproducer source and summaries directly on-list
if
preferred.
#regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
#regzbot monitor:
https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
Thanks,
Gerhard Schwanzer