Hi Xiaogang, Thanks. I tested your attached patch on my RX 7600 XT 
system. Test setup:
-
kernel 7.0.11 with 448ee453/bf2084a7 active
-
local revert not applied
-
your attached candidate fix applied
-
same self-contained v2 reproducer source as before, unchanged sha256: 
33347b5a1915f7452417f776c85527e55f825078c146163470bfe3eacabe3b27 
Command: ./kfd_svm_split_hsa_copy --upstream-ab Result:
-
10/10 runs completed successfully
-
all HSA/SDMA D2H copies completed
-
no ROCr memory access fault
-
no new GCVM_L2_PROTECTION_FAULT_STATUS
-
no SDMA0 permission fault
-
no GPU page fault in the kernel log So your patch fixes the reproducer 
on my system with the original reproducer unchanged. Please feel free to 
add: Tested-by: Gerhard Schwanzer
[email protected]
Thanks, Gerhard


On 05/06/26 at 19:59, Chen, Xiaogang wrote:
>
> AMD General
>
>
> Hi Gerhard:
>
> I think the cause is checking the last byte address of svm range for 
> 2MB alignment when decide possible huge page mapping. Your test case 
> has vm range that ends just one byte before alignment.
>
> I tested your app with the attachment, no page fault during sdma 
> operation. Please verify it.
>
> Thanks
>
> Xiaogang
>
> *From:*Chen, Xiaogang
> *Sent:* Wednesday, June 3, 2026 5:51 PM
> *To:* Gerhard Schwanzer <[email protected]>; [email protected]
> *Cc:* [email protected]; [email protected]; Deucher, 
> Alexander <[email protected]>; Yang, Philip <[email protected]>
> *Subject:* Re: [REGRESSION] drm/amdkfd: SVM split-tail remap 
> regression causes SDMA0 permission fault on RX 7600 XT
>
> Hi Gerhard:
>
> Thanks. I can build the app now. And I saw the regression. I am 
> triaging it.
>
> The purpose of this patch is to remap split svm ranges(head/tail) that 
> were mapped with huge page mapping(pmd), but cannot be mapped in huge 
> page mapping after split due to new svm ranges are not 2MB aligned. It 
> seems the remap decision misses case that both head and tail ranges 
> are from original range with huge page mappings were used. Will check....
>
> Regards
>
> Xiaogang
>
> On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote:
>
>     [Some people who received this message don't often get email 
> [email protected]. Learn why this is important 
> athttps://aka.ms/LearnAboutSenderIdentification ]
>
>     Hi Xiaogang,
>
>     Sorry, you are right. The source I uploaded was not self-contained, it 
> still
>
>     referenced trace_history_replay.inc from an older local replay mode.
>
>     I uploaded a self-contained v2 source to the GitLab report:
>
>     
> https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c
>
>     The --upstream-ab path does not use that replay table, but the missing
>
>     include
>
>     obviously broke fresh builds. The v2 source embeds the table and otherwise
>
>     preserves the same source.
>
>     I re-tested this v2 source before uploading:
>
>         - clean build from only kfd_svm_split_hsa_copy-v2.c: OK
>
>         - ./kfd_svm_split_hsa_copy --help: OK
>
>         - good/workaround kernel: --upstream-ab completed 10/10 runs, no new
>
>           GCVM/SDMA0/protection-fault messages in the test window
>
>         - broken kernel: --upstream-ab reproduced the SDMA0 permission fault;
>
>           the first kernel fault address matched the planned split-tail page
>
>     Validation summaries:
>
>     
> https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md
>
>     
> https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md
>
>     The intended triage command remains:
>
>         ./kfd_svm_split_hsa_copy --upstream-ab
>
>     Generic build shape is:
>
>         cc -O2 -g -Wall -Wextra -pthread \
>
>           -I/path/to/rocm/include -L/path/to/rocm/lib \
>
>           -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \
>
>           -lhsa-runtime64
>
>     If you still prefer a binary, please tell me the target runtime/distro. A
>
>     binary built on my NixOS system is Nix-store linked and likely not
>
>     portable to
>
>     your test system.
>
>     One more thing that would help me test any replacement fix: do you know 
> what
>
>     specific failure or workload 448ee453 was intended to fix? I would like to
>
>     avoid validating only the revert side while accidentally losing the 
> original
>
>     fix.
>
>     Thanks for catching this, and thanks for taking a look.
>
>     Regards,
>
>     Gerhard
>
>     On 06/03/2026 Chen, Xiaogang wrote:
>
>         I cannot compile kfd_svm_split_hsa_copy.c, there is no
>
>         "trace_history_replay.inc".
>
>         Or can you  send the test binary?  That should be enough to triage the
>
>         issue since it is a regression as you mentioned.
>
>         Regards
>
>         Xiaogang
>
>         On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:
>
>             Hi,
>
>             I would like to make sure this AMDKFD SVM regression is tracked 
> by the
>
>             Linux regression process.
>
>             GitLab report:
>
>                 https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
>
>             The regression was originally reported on 2026-01-27. It was 
> bisected
>
>             to the
>
>             same functional change that Alex Deucher's revert patch later 
> targeted:
>
>                 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
>
>                 drm/amdkfd: Use huge page size to check split svm range 
> alignment
>
>             The affected kernel line I tested identifies the same change as:
>
>                 bf2084a7b1d75d093b6a79df4c10142d49fbaa0e
>
>             Alex's revert patch:
>
>             
> https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
>
>             A small C/HSA reproducer is now available in the GitLab report. It
>
>             does not
>
>             require PyTorch, ComfyUI, Docker, model files, or the original
>
>             workload. It
>
>             uses ROCr/HSA, an anonymous THP-advised host mapping, explicit 
> KFD SVM
>
>             SET_ATTR ioctls, and an HSA SDMA D2H copy.
>
>             Single reproducer command, same binary on both kernels:
>
>                 ./kfd_svm_split_hsa_copy --upstream-ab
>
>             Same-machine A/B result on an RX 7600 XT:
>
>                 448ee453/bf2084a7 active:
>
>                   1/1 run faults with SDMA0 permission fault
>
>                   GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51
>
>                 448ee453/bf2084a7 locally reverted:
>
>                   10/10 runs complete
>
>                   no ROCr memory access fault
>
>                   no new GCVM/SDMA0 permission fault in dmesg
>
>             The bad fault page is inside the split tail and inside the SDMA 
> copy
>
>             range:
>
>                 critical tail: [0x722429d61..0x722429dff]
>
>                 copy pages:    [0x722429b30..0x722429d70]
>
>                 fault page:    0x722429d65
>
>             A full ftrace/PTE run with the same C reproducer/SVM sequence 
> also shows:
>
>                 split_tail ... current_remap=0 old_remap=1 missed=1
>
>                 MISSED_REMAP_CANDIDATE split=tail
>
>                 no amdgpu_vm_update_ptes covering the fault page after the 
> marker
>
>             before
>
>                 the fault-side GET_ATTR
>
>             The suspected code issue is that the split-tail/head remap 
> predicate
>
>             introduced
>
>             by 448ee453/bf2084a7 can miss tails inside the final 512-page 
> block.
>
>             Since
>
>             prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the 
> start
>
>             of the
>
>             final block, not an exclusive upper bound.
>
>             I also sent a short follow-up to amd-gfx with the reproducer/A-B
>
>             summary and
>
>             asked what original failure or workload 448ee453/bf2084a7 was 
> intended
>
>             to fix:
>
>             
> https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html
>
>             I can resend the reproducer source and summaries directly on-list 
> if
>
>             preferred.
>
>             #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
>
>             #regzbot monitor:
>
>             https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
>
>             Thanks,
>
>             Gerhard Schwanzer
>

Reply via email to