Thank you for the testing/confirming.

Xiaogang

On 6/5/2026 1:41 PM, Gerhard Schwanzer wrote:
Hi Xiaogang, Thanks. I tested your attached patch on my RX 7600 XT
system. Test setup:
-
kernel 7.0.11 with 448ee453/bf2084a7 active
-
local revert not applied
-
your attached candidate fix applied
-
same self-contained v2 reproducer source as before, unchanged sha256:
33347b5a1915f7452417f776c85527e55f825078c146163470bfe3eacabe3b27
Command: ./kfd_svm_split_hsa_copy --upstream-ab Result:
-
10/10 runs completed successfully
-
all HSA/SDMA D2H copies completed
-
no ROCr memory access fault
-
no new GCVM_L2_PROTECTION_FAULT_STATUS
-
no SDMA0 permission fault
-
no GPU page fault in the kernel log So your patch fixes the reproducer
on my system with the original reproducer unchanged. Please feel free to
add: Tested-by: Gerhard Schwanzer
[email protected]
Thanks, Gerhard


On 05/06/26 at 19:59, Chen, Xiaogang wrote:
AMD General


Hi Gerhard:

I think the cause is checking the last byte address of svm range for
2MB alignment when decide possible huge page mapping. Your test case
has vm range that ends just one byte before alignment.

I tested your app with the attachment, no page fault during sdma
operation. Please verify it.

Thanks

Xiaogang

*From:*Chen, Xiaogang
*Sent:* Wednesday, June 3, 2026 5:51 PM
*To:* Gerhard Schwanzer <[email protected]>; [email protected]
*Cc:* [email protected]; [email protected]; Deucher,
Alexander <[email protected]>; Yang, Philip <[email protected]>
*Subject:* Re: [REGRESSION] drm/amdkfd: SVM split-tail remap
regression causes SDMA0 permission fault on RX 7600 XT

Hi Gerhard:

Thanks. I can build the app now. And I saw the regression. I am
triaging it.

The purpose of this patch is to remap split svm ranges(head/tail) that
were mapped with huge page mapping(pmd), but cannot be mapped in huge
page mapping after split due to new svm ranges are not 2MB aligned. It
seems the remap decision misses case that both head and tail ranges
are from original range with huge page mappings were used. Will check....

Regards

Xiaogang

On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote:

     [Some people who received this message don't often get email 
[email protected]. Learn why this is important 
athttps://aka.ms/LearnAboutSenderIdentification ]

     Hi Xiaogang,

     Sorry, you are right. The source I uploaded was not self-contained, it 
still

     referenced trace_history_replay.inc from an older local replay mode.

     I uploaded a self-contained v2 source to the GitLab report:

     
https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c

     The --upstream-ab path does not use that replay table, but the missing

     include

     obviously broke fresh builds. The v2 source embeds the table and otherwise

     preserves the same source.

     I re-tested this v2 source before uploading:

         - clean build from only kfd_svm_split_hsa_copy-v2.c: OK

         - ./kfd_svm_split_hsa_copy --help: OK

         - good/workaround kernel: --upstream-ab completed 10/10 runs, no new

           GCVM/SDMA0/protection-fault messages in the test window

         - broken kernel: --upstream-ab reproduced the SDMA0 permission fault;

           the first kernel fault address matched the planned split-tail page

     Validation summaries:

     
https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md

     
https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md

     The intended triage command remains:

         ./kfd_svm_split_hsa_copy --upstream-ab

     Generic build shape is:

         cc -O2 -g -Wall -Wextra -pthread \

           -I/path/to/rocm/include -L/path/to/rocm/lib \

           -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \

           -lhsa-runtime64

     If you still prefer a binary, please tell me the target runtime/distro. A

     binary built on my NixOS system is Nix-store linked and likely not

     portable to

     your test system.

     One more thing that would help me test any replacement fix: do you know 
what

     specific failure or workload 448ee453 was intended to fix? I would like to

     avoid validating only the revert side while accidentally losing the 
original

     fix.

     Thanks for catching this, and thanks for taking a look.

     Regards,

     Gerhard

     On 06/03/2026 Chen, Xiaogang wrote:

         I cannot compile kfd_svm_split_hsa_copy.c, there is no

         "trace_history_replay.inc".

         Or can you  send the test binary?  That should be enough to triage the

         issue since it is a regression as you mentioned.

         Regards

         Xiaogang

         On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:

             Hi,

             I would like to make sure this AMDKFD SVM regression is tracked by 
the

             Linux regression process.

             GitLab report:

                 https://gitlab.freedesktop.org/drm/amd/-/work_items/4914

             The regression was originally reported on 2026-01-27. It was 
bisected

             to the

             same functional change that Alex Deucher's revert patch later 
targeted:

                 448ee45353ef9fb1a34f5f26eb3f48923c6f0898

                 drm/amdkfd: Use huge page size to check split svm range 
alignment

             The affected kernel line I tested identifies the same change as:

                 bf2084a7b1d75d093b6a79df4c10142d49fbaa0e

             Alex's revert patch:

             
https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html

             A small C/HSA reproducer is now available in the GitLab report. It

             does not

             require PyTorch, ComfyUI, Docker, model files, or the original

             workload. It

             uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD 
SVM

             SET_ATTR ioctls, and an HSA SDMA D2H copy.

             Single reproducer command, same binary on both kernels:

                 ./kfd_svm_split_hsa_copy --upstream-ab

             Same-machine A/B result on an RX 7600 XT:

                 448ee453/bf2084a7 active:

                   1/1 run faults with SDMA0 permission fault

                   GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51

                 448ee453/bf2084a7 locally reverted:

                   10/10 runs complete

                   no ROCr memory access fault

                   no new GCVM/SDMA0 permission fault in dmesg

             The bad fault page is inside the split tail and inside the SDMA 
copy

             range:

                 critical tail: [0x722429d61..0x722429dff]

                 copy pages:    [0x722429b30..0x722429d70]

                 fault page:    0x722429d65

             A full ftrace/PTE run with the same C reproducer/SVM sequence also 
shows:

                 split_tail ... current_remap=0 old_remap=1 missed=1

                 MISSED_REMAP_CANDIDATE split=tail

                 no amdgpu_vm_update_ptes covering the fault page after the 
marker

             before

                 the fault-side GET_ATTR

             The suspected code issue is that the split-tail/head remap 
predicate

             introduced

             by 448ee453/bf2084a7 can miss tails inside the final 512-page 
block.

             Since

             prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the 
start

             of the

             final block, not an exclusive upper bound.

             I also sent a short follow-up to amd-gfx with the reproducer/A-B

             summary and

             asked what original failure or workload 448ee453/bf2084a7 was 
intended

             to fix:

             
https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html

             I can resend the reproducer source and summaries directly on-list 
if

             preferred.

             #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898

             #regzbot monitor:

             https://gitlab.freedesktop.org/drm/amd/-/work_items/4914

             Thanks,

             Gerhard Schwanzer

Reply via email to