Replying to Alex's revert patch: https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
Hi, The requested small reproducer for drm/amd#4914 is now available: https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 It is a small C/HSA reproducer. It does not require PyTorch, ComfyUI, Docker, model files, or the original workload. Same binary, same command on both kernels: ./kfd_svm_split_hsa_copy --upstream-ab A/B result on the same RX 7600 XT system: bf2084a7 active: 1/1 run faults with SDMA0 permission fault GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51 bf2084a7 reverted: 10/10 runs complete no ROCr memory access fault no new GCVM/SDMA0 permission fault in dmesg The bad fault page is inside the split tail and inside the SDMA copy range: critical tail: [0x722429d61..0x722429dff] copy pages: [0x722429b30..0x722429d70] fault page: 0x722429d65 A full ftrace/PTE run with the same C reproducer/SVM sequence also shows: split_tail ... current_remap=0 old_remap=1 missed=1 MISSED_REMAP_CANDIDATE split=tail no amdgpu_vm_update_ptes covering the fault page after the marker before the fault-side GET_ATTR One important open question for me is: What original failure or workload was 448ee453/bf2084a7 intended to fix? If there is a test case for the original problem, I can check whether a replacement fix covers both that case and this regression. Could this revert, or an equivalent fix, be reconsidered? I can resend the reproducer and summaries directly on-list if preferred. Thanks, Gerhard Schwanzer
publickey - [email protected] - 0xE32DB141.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
