AMD General Hi Gerhard:
I think the cause is checking the last byte address of svm range for 2MB alignment when decide possible huge page mapping. Your test case has vm range that ends just one byte before alignment. I tested your app with the attachment, no page fault during sdma operation. Please verify it. Thanks Xiaogang From: Chen, Xiaogang Sent: Wednesday, June 3, 2026 5:51 PM To: Gerhard Schwanzer <[email protected]>; [email protected] Cc: [email protected]; [email protected]; Deucher, Alexander <[email protected]>; Yang, Philip <[email protected]> Subject: Re: [REGRESSION] drm/amdkfd: SVM split-tail remap regression causes SDMA0 permission fault on RX 7600 XT Hi Gerhard: Thanks. I can build the app now. And I saw the regression. I am triaging it. The purpose of this patch is to remap split svm ranges(head/tail) that were mapped with huge page mapping(pmd), but cannot be mapped in huge page mapping after split due to new svm ranges are not 2MB aligned. It seems the remap decision misses case that both head and tail ranges are from original range with huge page mappings were used. Will check.... Regards Xiaogang On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote: [Some people who received this message don't often get email from [email protected]<mailto:[email protected]>. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] Hi Xiaogang, Sorry, you are right. The source I uploaded was not self-contained, it still referenced trace_history_replay.inc from an older local replay mode. I uploaded a self-contained v2 source to the GitLab report: https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c The --upstream-ab path does not use that replay table, but the missing include obviously broke fresh builds. The v2 source embeds the table and otherwise preserves the same source. I re-tested this v2 source before uploading: - clean build from only kfd_svm_split_hsa_copy-v2.c: OK - ./kfd_svm_split_hsa_copy --help: OK - good/workaround kernel: --upstream-ab completed 10/10 runs, no new GCVM/SDMA0/protection-fault messages in the test window - broken kernel: --upstream-ab reproduced the SDMA0 permission fault; the first kernel fault address matched the planned split-tail page Validation summaries: https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md The intended triage command remains: ./kfd_svm_split_hsa_copy --upstream-ab Generic build shape is: cc -O2 -g -Wall -Wextra -pthread \ -I/path/to/rocm/include -L/path/to/rocm/lib \ -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \ -lhsa-runtime64 If you still prefer a binary, please tell me the target runtime/distro. A binary built on my NixOS system is Nix-store linked and likely not portable to your test system. One more thing that would help me test any replacement fix: do you know what specific failure or workload 448ee453 was intended to fix? I would like to avoid validating only the revert side while accidentally losing the original fix. Thanks for catching this, and thanks for taking a look. Regards, Gerhard On 06/03/2026 Chen, Xiaogang wrote: I cannot compile kfd_svm_split_hsa_copy.c, there is no "trace_history_replay.inc". Or can you send the test binary? That should be enough to triage the issue since it is a regression as you mentioned. Regards Xiaogang On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote: Hi, I would like to make sure this AMDKFD SVM regression is tracked by the Linux regression process. GitLab report: https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 The regression was originally reported on 2026-01-27. It was bisected to the same functional change that Alex Deucher's revert patch later targeted: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 drm/amdkfd: Use huge page size to check split svm range alignment The affected kernel line I tested identifies the same change as: bf2084a7b1d75d093b6a79df4c10142d49fbaa0e Alex's revert patch: https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html A small C/HSA reproducer is now available in the GitLab report. It does not require PyTorch, ComfyUI, Docker, model files, or the original workload. It uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM SET_ATTR ioctls, and an HSA SDMA D2H copy. Single reproducer command, same binary on both kernels: ./kfd_svm_split_hsa_copy --upstream-ab Same-machine A/B result on an RX 7600 XT: 448ee453/bf2084a7 active: 1/1 run faults with SDMA0 permission fault GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51 448ee453/bf2084a7 locally reverted: 10/10 runs complete no ROCr memory access fault no new GCVM/SDMA0 permission fault in dmesg The bad fault page is inside the split tail and inside the SDMA copy range: critical tail: [0x722429d61..0x722429dff] copy pages: [0x722429b30..0x722429d70] fault page: 0x722429d65 A full ftrace/PTE run with the same C reproducer/SVM sequence also shows: split_tail ... current_remap=0 old_remap=1 missed=1 MISSED_REMAP_CANDIDATE split=tail no amdgpu_vm_update_ptes covering the fault page after the marker before the fault-side GET_ATTR The suspected code issue is that the split-tail/head remap predicate introduced by 448ee453/bf2084a7 can miss tails inside the final 512-page block. Since prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start of the final block, not an exclusive upper bound. I also sent a short follow-up to amd-gfx with the reproducer/A-B summary and asked what original failure or workload 448ee453/bf2084a7 was intended to fix: https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html I can resend the reproducer source and summaries directly on-list if preferred. #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 #regzbot monitor: https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 Thanks, Gerhard Schwanzer
0001-drm-amdkfd-Use-last-1-of-vm-range-to-check-2MB-huge-.patch
Description: 0001-drm-amdkfd-Use-last-1-of-vm-range-to-check-2MB-huge-.patch
