Hi Xiaogang, Sorry, you are right. The source I uploaded was not self-contained, it still referenced trace_history_replay.inc from an older local replay mode.
I uploaded a self-contained v2 source to the GitLab report: https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c The --upstream-ab path does not use that replay table, but the missing include obviously broke fresh builds. The v2 source embeds the table and otherwise preserves the same source. I re-tested this v2 source before uploading: - clean build from only kfd_svm_split_hsa_copy-v2.c: OK - ./kfd_svm_split_hsa_copy --help: OK - good/workaround kernel: --upstream-ab completed 10/10 runs, no new GCVM/SDMA0/protection-fault messages in the test window - broken kernel: --upstream-ab reproduced the SDMA0 permission fault; the first kernel fault address matched the planned split-tail page Validation summaries: https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md The intended triage command remains: ./kfd_svm_split_hsa_copy --upstream-ab Generic build shape is: cc -O2 -g -Wall -Wextra -pthread \ -I/path/to/rocm/include -L/path/to/rocm/lib \ -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \ -lhsa-runtime64 If you still prefer a binary, please tell me the target runtime/distro. A binary built on my NixOS system is Nix-store linked and likely not portable to your test system. One more thing that would help me test any replacement fix: do you know what specific failure or workload 448ee453 was intended to fix? I would like to avoid validating only the revert side while accidentally losing the original fix. Thanks for catching this, and thanks for taking a look. Regards, Gerhard On 06/03/2026 Chen, Xiaogang wrote: > I cannot compile kfd_svm_split_hsa_copy.c, there is no > "trace_history_replay.inc". > > Or can you send the test binary? That should be enough to triage the > issue since it is a regression as you mentioned. > > Regards > > Xiaogang > > On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote: >> Hi, >> >> I would like to make sure this AMDKFD SVM regression is tracked by the >> Linux regression process. >> >> GitLab report: >> >> https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 >> >> The regression was originally reported on 2026-01-27. It was bisected >> to the >> same functional change that Alex Deucher's revert patch later targeted: >> >> 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 >> drm/amdkfd: Use huge page size to check split svm range alignment >> >> The affected kernel line I tested identifies the same change as: >> >> bf2084a7b1d75d093b6a79df4c10142d49fbaa0e >> >> Alex's revert patch: >> >> https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html >> >> A small C/HSA reproducer is now available in the GitLab report. It >> does not >> require PyTorch, ComfyUI, Docker, model files, or the original >> workload. It >> uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM >> SET_ATTR ioctls, and an HSA SDMA D2H copy. >> >> Single reproducer command, same binary on both kernels: >> >> ./kfd_svm_split_hsa_copy --upstream-ab >> >> Same-machine A/B result on an RX 7600 XT: >> >> 448ee453/bf2084a7 active: >> 1/1 run faults with SDMA0 permission fault >> GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51 >> >> 448ee453/bf2084a7 locally reverted: >> 10/10 runs complete >> no ROCr memory access fault >> no new GCVM/SDMA0 permission fault in dmesg >> >> The bad fault page is inside the split tail and inside the SDMA copy >> range: >> >> critical tail: [0x722429d61..0x722429dff] >> copy pages: [0x722429b30..0x722429d70] >> fault page: 0x722429d65 >> >> A full ftrace/PTE run with the same C reproducer/SVM sequence also shows: >> >> split_tail ... current_remap=0 old_remap=1 missed=1 >> MISSED_REMAP_CANDIDATE split=tail >> no amdgpu_vm_update_ptes covering the fault page after the marker >> before >> the fault-side GET_ATTR >> >> The suspected code issue is that the split-tail/head remap predicate >> introduced >> by 448ee453/bf2084a7 can miss tails inside the final 512-page block. >> Since >> prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start >> of the >> final block, not an exclusive upper bound. >> >> I also sent a short follow-up to amd-gfx with the reproducer/A-B >> summary and >> asked what original failure or workload 448ee453/bf2084a7 was intended >> to fix: >> >> https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html >> >> I can resend the reproducer source and summaries directly on-list if >> preferred. >> >> #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 >> #regzbot monitor: >> https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 >> >> Thanks, >> Gerhard Schwanzer
