I cannot compile kfd_svm_split_hsa_copy.c, there is no
"trace_history_replay.inc".
Or can you send the test binary? That should be enough to triage the
issue since it is a regression as you mentioned.
Regards
Xiaogang
On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:
Hi,
I would like to make sure this AMDKFD SVM regression is tracked by the
Linux regression process.
GitLab report:
https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
The regression was originally reported on 2026-01-27. It was bisected
to the
same functional change that Alex Deucher's revert patch later targeted:
448ee45353ef9fb1a34f5f26eb3f48923c6f0898
drm/amdkfd: Use huge page size to check split svm range alignment
The affected kernel line I tested identifies the same change as:
bf2084a7b1d75d093b6a79df4c10142d49fbaa0e
Alex's revert patch:
https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
A small C/HSA reproducer is now available in the GitLab report. It
does not
require PyTorch, ComfyUI, Docker, model files, or the original
workload. It
uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM
SET_ATTR ioctls, and an HSA SDMA D2H copy.
Single reproducer command, same binary on both kernels:
./kfd_svm_split_hsa_copy --upstream-ab
Same-machine A/B result on an RX 7600 XT:
448ee453/bf2084a7 active:
1/1 run faults with SDMA0 permission fault
GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51
448ee453/bf2084a7 locally reverted:
10/10 runs complete
no ROCr memory access fault
no new GCVM/SDMA0 permission fault in dmesg
The bad fault page is inside the split tail and inside the SDMA copy
range:
critical tail: [0x722429d61..0x722429dff]
copy pages: [0x722429b30..0x722429d70]
fault page: 0x722429d65
A full ftrace/PTE run with the same C reproducer/SVM sequence also shows:
split_tail ... current_remap=0 old_remap=1 missed=1
MISSED_REMAP_CANDIDATE split=tail
no amdgpu_vm_update_ptes covering the fault page after the marker
before
the fault-side GET_ATTR
The suspected code issue is that the split-tail/head remap predicate
introduced
by 448ee453/bf2084a7 can miss tails inside the final 512-page block.
Since
prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start
of the
final block, not an exclusive upper bound.
I also sent a short follow-up to amd-gfx with the reproducer/A-B
summary and
asked what original failure or workload 448ee453/bf2084a7 was intended
to fix:
https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html
I can resend the reproducer source and summaries directly on-list if
preferred.
#regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
#regzbot monitor:
https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
Thanks,
Gerhard Schwanzer