On Fri, May 29, 2026 at 4:29 PM <[email protected]> wrote: > > From: Vitaly Prosyak <[email protected]> > > Problem: > While developing the amd_close_race IGT test (which intentionally triggers > execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table > entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce > zero diagnostic output. The GPU simply hangs silently for ~10s until the > scheduler timeout fires. There is no way to distinguish an execute > permission fault from any other type of GPU hang. > > Root cause: > GFX 10.1.x defaults to noretry=0, which sets > RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers > (gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE, > wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware > loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the > translation indefinitely, expecting software to eventually fix the > permission bits (as happens in SVM/HMM recovery). No interrupt of any kind > reaches the IH ring. > > This is different from invalid-page faults (V=0) which DO generate a retry > fault interrupt that the driver can escalate to a no-retry fault. Permission > faults with valid PTEs loop silently forever in hardware. > > GFX 10.3+ already defaults to noretry=1, which makes permission faults > generate immediate L2 protection fault interrupts. GFX 10.1.x was > inadvertently left out of this default. > > Fix: > Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to > IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line > change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer > generations. > > With noretry=1, the existing non-retry fault handler > (gmc_v10_0_process_interrupt) already decodes and prints the full > GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS, > faulting address, VMID, PASID, and process name. No additional logging > code is needed — the fix is purely routing permission faults to the > existing, fully-capable non-retry interrupt handler. > > v2: Dropped GFX10-specific logging from gmc_v10_0.c and > kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry > fault handler, but with noretry=1 permission faults take the non-retry > path — the v1 retry handler code was dead and would never execute. > > Tested on Navi10 (GFX 10.1.10): > - Execute permission faults now produce immediate, clear output: > [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592) > Process amd_close_race pid 13380 thread amd_close_race pid 13384 > in page at address 0x40001000 from client 0x1b (UTCL2) > GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881 > PERMISSION_FAULTS: 0x8 > - No regressions with properly-mapped GPU workloads > > Cc: Christian Koenig <[email protected]> > Cc: Alex Deucher <[email protected]> > Cc: Felix Kuehling <[email protected]> > Signed-off-by: Vitaly Prosyak <[email protected]>
Acked-by: Alex Deucher <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > index 13bec8461cde..a9bb01c6cb58 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > @@ -1014,7 +1014,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev) > gc_ver == IP_VERSION(9, 4, 3) || > gc_ver == IP_VERSION(9, 4, 4) || > gc_ver == IP_VERSION(9, 5, 0) || > - gc_ver >= IP_VERSION(10, 3, 0)); > + gc_ver >= IP_VERSION(10, 1, 0)); > > /* For GFX12.1 B0, set xnack (retry) on as default */ > if (gc_ver == IP_VERSION(12, 1, 0) && (adev->rev_id & 0xf) == 0x1) > -- > 2.54.0 >
