Hi Gerhard:
Thanks. I can build the app now. And I saw the regression. I am triaging it.
The purpose of this patch is to remap split svm ranges(head/tail) that
were mapped with huge page mapping(pmd), but cannot be mapped in huge
page mapping after split due to new svm ranges are not 2MB aligned. It
seems the remap decision misses case that both head and tail ranges are
from original range with huge page mappings were used. Will check....
Regards
Xiaogang
On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote:
[Some people who received this message don't often get email [email protected].
Learn why this is important athttps://aka.ms/LearnAboutSenderIdentification ]
Hi Xiaogang,
Sorry, you are right. The source I uploaded was not self-contained, it still
referenced trace_history_replay.inc from an older local replay mode.
I uploaded a self-contained v2 source to the GitLab report:
https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c
The --upstream-ab path does not use that replay table, but the missing
include
obviously broke fresh builds. The v2 source embeds the table and otherwise
preserves the same source.
I re-tested this v2 source before uploading:
- clean build from only kfd_svm_split_hsa_copy-v2.c: OK
- ./kfd_svm_split_hsa_copy --help: OK
- good/workaround kernel: --upstream-ab completed 10/10 runs, no new
GCVM/SDMA0/protection-fault messages in the test window
- broken kernel: --upstream-ab reproduced the SDMA0 permission fault;
the first kernel fault address matched the planned split-tail page
Validation summaries:
https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md
https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md
The intended triage command remains:
./kfd_svm_split_hsa_copy --upstream-ab
Generic build shape is:
cc -O2 -g -Wall -Wextra -pthread \
-I/path/to/rocm/include -L/path/to/rocm/lib \
-o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \
-lhsa-runtime64
If you still prefer a binary, please tell me the target runtime/distro. A
binary built on my NixOS system is Nix-store linked and likely not
portable to
your test system.
One more thing that would help me test any replacement fix: do you know what
specific failure or workload 448ee453 was intended to fix? I would like to
avoid validating only the revert side while accidentally losing the original
fix.
Thanks for catching this, and thanks for taking a look.
Regards,
Gerhard
On 06/03/2026 Chen, Xiaogang wrote:
I cannot compile kfd_svm_split_hsa_copy.c, there is no
"trace_history_replay.inc".
Or can you send the test binary? That should be enough to triage the
issue since it is a regression as you mentioned.
Regards
Xiaogang
On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:
Hi,
I would like to make sure this AMDKFD SVM regression is tracked by the
Linux regression process.
GitLab report:
https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
The regression was originally reported on 2026-01-27. It was bisected
to the
same functional change that Alex Deucher's revert patch later targeted:
448ee45353ef9fb1a34f5f26eb3f48923c6f0898
drm/amdkfd: Use huge page size to check split svm range alignment
The affected kernel line I tested identifies the same change as:
bf2084a7b1d75d093b6a79df4c10142d49fbaa0e
Alex's revert patch:
https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
A small C/HSA reproducer is now available in the GitLab report. It
does not
require PyTorch, ComfyUI, Docker, model files, or the original
workload. It
uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM
SET_ATTR ioctls, and an HSA SDMA D2H copy.
Single reproducer command, same binary on both kernels:
./kfd_svm_split_hsa_copy --upstream-ab
Same-machine A/B result on an RX 7600 XT:
448ee453/bf2084a7 active:
1/1 run faults with SDMA0 permission fault
GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51
448ee453/bf2084a7 locally reverted:
10/10 runs complete
no ROCr memory access fault
no new GCVM/SDMA0 permission fault in dmesg
The bad fault page is inside the split tail and inside the SDMA copy
range:
critical tail: [0x722429d61..0x722429dff]
copy pages: [0x722429b30..0x722429d70]
fault page: 0x722429d65
A full ftrace/PTE run with the same C reproducer/SVM sequence also shows:
split_tail ... current_remap=0 old_remap=1 missed=1
MISSED_REMAP_CANDIDATE split=tail
no amdgpu_vm_update_ptes covering the fault page after the marker
before
the fault-side GET_ATTR
The suspected code issue is that the split-tail/head remap predicate
introduced
by 448ee453/bf2084a7 can miss tails inside the final 512-page block.
Since
prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start
of the
final block, not an exclusive upper bound.
I also sent a short follow-up to amd-gfx with the reproducer/A-B
summary and
asked what original failure or workload 448ee453/bf2084a7 was intended
to fix:
https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html
I can resend the reproducer source and summaries directly on-list if
preferred.
#regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
#regzbot monitor:
https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
Thanks,
Gerhard Schwanzer