Hi Xiaogang, Thanks. I tested your attached patch on my RX 7600 XT system. Test setup: - kernel 7.0.11 with 448ee453/bf2084a7 active - local revert not applied - your attached candidate fix applied - same self-contained v2 reproducer source as before, unchanged sha256: 33347b5a1915f7452417f776c85527e55f825078c146163470bfe3eacabe3b27 Command: ./kfd_svm_split_hsa_copy --upstream-ab Result: - 10/10 runs completed successfully - all HSA/SDMA D2H copies completed - no ROCr memory access fault - no new GCVM_L2_PROTECTION_FAULT_STATUS - no SDMA0 permission fault - no GPU page fault in the kernel log So your patch fixes the reproducer on my system with the original reproducer unchanged. Please feel free to add: Tested-by: Gerhard Schwanzer [email protected] Thanks, Gerhard
On 05/06/26 at 19:59, Chen, Xiaogang wrote: > > AMD General > > > Hi Gerhard: > > I think the cause is checking the last byte address of svm range for > 2MB alignment when decide possible huge page mapping. Your test case > has vm range that ends just one byte before alignment. > > I tested your app with the attachment, no page fault during sdma > operation. Please verify it. > > Thanks > > Xiaogang > > *From:*Chen, Xiaogang > *Sent:* Wednesday, June 3, 2026 5:51 PM > *To:* Gerhard Schwanzer <[email protected]>; [email protected] > *Cc:* [email protected]; [email protected]; Deucher, > Alexander <[email protected]>; Yang, Philip <[email protected]> > *Subject:* Re: [REGRESSION] drm/amdkfd: SVM split-tail remap > regression causes SDMA0 permission fault on RX 7600 XT > > Hi Gerhard: > > Thanks. I can build the app now. And I saw the regression. I am > triaging it. > > The purpose of this patch is to remap split svm ranges(head/tail) that > were mapped with huge page mapping(pmd), but cannot be mapped in huge > page mapping after split due to new svm ranges are not 2MB aligned. It > seems the remap decision misses case that both head and tail ranges > are from original range with huge page mappings were used. Will check.... > > Regards > > Xiaogang > > On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote: > > [Some people who received this message don't often get email > [email protected]. Learn why this is important > athttps://aka.ms/LearnAboutSenderIdentification ] > > Hi Xiaogang, > > Sorry, you are right. The source I uploaded was not self-contained, it > still > > referenced trace_history_replay.inc from an older local replay mode. > > I uploaded a self-contained v2 source to the GitLab report: > > > https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c > > The --upstream-ab path does not use that replay table, but the missing > > include > > obviously broke fresh builds. The v2 source embeds the table and otherwise > > preserves the same source. > > I re-tested this v2 source before uploading: > > - clean build from only kfd_svm_split_hsa_copy-v2.c: OK > > - ./kfd_svm_split_hsa_copy --help: OK > > - good/workaround kernel: --upstream-ab completed 10/10 runs, no new > > GCVM/SDMA0/protection-fault messages in the test window > > - broken kernel: --upstream-ab reproduced the SDMA0 permission fault; > > the first kernel fault address matched the planned split-tail page > > Validation summaries: > > > https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md > > > https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md > > The intended triage command remains: > > ./kfd_svm_split_hsa_copy --upstream-ab > > Generic build shape is: > > cc -O2 -g -Wall -Wextra -pthread \ > > -I/path/to/rocm/include -L/path/to/rocm/lib \ > > -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \ > > -lhsa-runtime64 > > If you still prefer a binary, please tell me the target runtime/distro. A > > binary built on my NixOS system is Nix-store linked and likely not > > portable to > > your test system. > > One more thing that would help me test any replacement fix: do you know > what > > specific failure or workload 448ee453 was intended to fix? I would like to > > avoid validating only the revert side while accidentally losing the > original > > fix. > > Thanks for catching this, and thanks for taking a look. > > Regards, > > Gerhard > > On 06/03/2026 Chen, Xiaogang wrote: > > I cannot compile kfd_svm_split_hsa_copy.c, there is no > > "trace_history_replay.inc". > > Or can you send the test binary? That should be enough to triage the > > issue since it is a regression as you mentioned. > > Regards > > Xiaogang > > On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote: > > Hi, > > I would like to make sure this AMDKFD SVM regression is tracked > by the > > Linux regression process. > > GitLab report: > > https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 > > The regression was originally reported on 2026-01-27. It was > bisected > > to the > > same functional change that Alex Deucher's revert patch later > targeted: > > 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 > > drm/amdkfd: Use huge page size to check split svm range > alignment > > The affected kernel line I tested identifies the same change as: > > bf2084a7b1d75d093b6a79df4c10142d49fbaa0e > > Alex's revert patch: > > > https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html > > A small C/HSA reproducer is now available in the GitLab report. It > > does not > > require PyTorch, ComfyUI, Docker, model files, or the original > > workload. It > > uses ROCr/HSA, an anonymous THP-advised host mapping, explicit > KFD SVM > > SET_ATTR ioctls, and an HSA SDMA D2H copy. > > Single reproducer command, same binary on both kernels: > > ./kfd_svm_split_hsa_copy --upstream-ab > > Same-machine A/B result on an RX 7600 XT: > > 448ee453/bf2084a7 active: > > 1/1 run faults with SDMA0 permission fault > > GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51 > > 448ee453/bf2084a7 locally reverted: > > 10/10 runs complete > > no ROCr memory access fault > > no new GCVM/SDMA0 permission fault in dmesg > > The bad fault page is inside the split tail and inside the SDMA > copy > > range: > > critical tail: [0x722429d61..0x722429dff] > > copy pages: [0x722429b30..0x722429d70] > > fault page: 0x722429d65 > > A full ftrace/PTE run with the same C reproducer/SVM sequence > also shows: > > split_tail ... current_remap=0 old_remap=1 missed=1 > > MISSED_REMAP_CANDIDATE split=tail > > no amdgpu_vm_update_ptes covering the fault page after the > marker > > before > > the fault-side GET_ATTR > > The suspected code issue is that the split-tail/head remap > predicate > > introduced > > by 448ee453/bf2084a7 can miss tails inside the final 512-page > block. > > Since > > prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the > start > > of the > > final block, not an exclusive upper bound. > > I also sent a short follow-up to amd-gfx with the reproducer/A-B > > summary and > > asked what original failure or workload 448ee453/bf2084a7 was > intended > > to fix: > > > https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html > > I can resend the reproducer source and summaries directly on-list > if > > preferred. > > #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 > > #regzbot monitor: > > https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 > > Thanks, > > Gerhard Schwanzer >
