Hi Xiaogang,

Sorry, you are right. The source I uploaded was not self-contained, it still
referenced trace_history_replay.inc from an older local replay mode.

I uploaded a self-contained v2 source to the GitLab report:

https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c

The --upstream-ab path does not use that replay table, but the missing 
include
obviously broke fresh builds. The v2 source embeds the table and otherwise
preserves the same source.

I re-tested this v2 source before uploading:

   - clean build from only kfd_svm_split_hsa_copy-v2.c: OK
   - ./kfd_svm_split_hsa_copy --help: OK
   - good/workaround kernel: --upstream-ab completed 10/10 runs, no new
     GCVM/SDMA0/protection-fault messages in the test window
   - broken kernel: --upstream-ab reproduced the SDMA0 permission fault;
     the first kernel fault address matched the planned split-tail page

Validation summaries:

https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md
https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md

The intended triage command remains:

   ./kfd_svm_split_hsa_copy --upstream-ab

Generic build shape is:

   cc -O2 -g -Wall -Wextra -pthread \
     -I/path/to/rocm/include -L/path/to/rocm/lib \
     -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \
     -lhsa-runtime64

If you still prefer a binary, please tell me the target runtime/distro. A
binary built on my NixOS system is Nix-store linked and likely not 
portable to
your test system.

One more thing that would help me test any replacement fix: do you know what
specific failure or workload 448ee453 was intended to fix? I would like to
avoid validating only the revert side while accidentally losing the original
fix.

Thanks for catching this, and thanks for taking a look.

Regards,
Gerhard


On 06/03/2026 Chen, Xiaogang wrote:

> I cannot compile kfd_svm_split_hsa_copy.c, there is no
> "trace_history_replay.inc".
>
> Or can you  send the test binary?  That should be enough to triage the
> issue since it is a regression as you mentioned.
>
> Regards
>
> Xiaogang
>
> On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:
>> Hi,
>>
>> I would like to make sure this AMDKFD SVM regression is tracked by the
>> Linux regression process.
>>
>> GitLab report:
>>
>>    https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
>>
>> The regression was originally reported on 2026-01-27. It was bisected
>> to the
>> same functional change that Alex Deucher's revert patch later targeted:
>>
>>    448ee45353ef9fb1a34f5f26eb3f48923c6f0898
>>    drm/amdkfd: Use huge page size to check split svm range alignment
>>
>> The affected kernel line I tested identifies the same change as:
>>
>>    bf2084a7b1d75d093b6a79df4c10142d49fbaa0e
>>
>> Alex's revert patch:
>>
>> https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
>>
>> A small C/HSA reproducer is now available in the GitLab report. It
>> does not
>> require PyTorch, ComfyUI, Docker, model files, or the original
>> workload. It
>> uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM
>> SET_ATTR ioctls, and an HSA SDMA D2H copy.
>>
>> Single reproducer command, same binary on both kernels:
>>
>>    ./kfd_svm_split_hsa_copy --upstream-ab
>>
>> Same-machine A/B result on an RX 7600 XT:
>>
>>    448ee453/bf2084a7 active:
>>      1/1 run faults with SDMA0 permission fault
>>      GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51
>>
>>    448ee453/bf2084a7 locally reverted:
>>      10/10 runs complete
>>      no ROCr memory access fault
>>      no new GCVM/SDMA0 permission fault in dmesg
>>
>> The bad fault page is inside the split tail and inside the SDMA copy
>> range:
>>
>>    critical tail: [0x722429d61..0x722429dff]
>>    copy pages:    [0x722429b30..0x722429d70]
>>    fault page:    0x722429d65
>>
>> A full ftrace/PTE run with the same C reproducer/SVM sequence also shows:
>>
>>    split_tail ... current_remap=0 old_remap=1 missed=1
>>    MISSED_REMAP_CANDIDATE split=tail
>>    no amdgpu_vm_update_ptes covering the fault page after the marker
>> before
>>    the fault-side GET_ATTR
>>
>> The suspected code issue is that the split-tail/head remap predicate
>> introduced
>> by 448ee453/bf2084a7 can miss tails inside the final 512-page block.
>> Since
>> prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start
>> of the
>> final block, not an exclusive upper bound.
>>
>> I also sent a short follow-up to amd-gfx with the reproducer/A-B
>> summary and
>> asked what original failure or workload 448ee453/bf2084a7 was intended
>> to fix:
>>
>> https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html
>>
>> I can resend the reproducer source and summaries directly on-list if
>> preferred.
>>
>> #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
>> #regzbot monitor:
>> https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
>>
>> Thanks,
>> Gerhard Schwanzer

Reply via email to