On Fri, May 29, 2026 at 4:29 PM <[email protected]> wrote:
>
> From: Vitaly Prosyak <[email protected]>
>
> Problem:
> While developing the amd_close_race IGT test (which intentionally triggers
> execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table
> entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce
> zero diagnostic output. The GPU simply hangs silently for ~10s until the
> scheduler timeout fires. There is no way to distinguish an execute
> permission fault from any other type of GPU hang.
>
> Root cause:
> GFX 10.1.x defaults to noretry=0, which sets
> RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers
> (gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE,
> wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware
> loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the
> translation indefinitely, expecting software to eventually fix the
> permission bits (as happens in SVM/HMM recovery). No interrupt of any kind
> reaches the IH ring.
>
> This is different from invalid-page faults (V=0) which DO generate a retry
> fault interrupt that the driver can escalate to a no-retry fault. Permission
> faults with valid PTEs loop silently forever in hardware.
>
> GFX 10.3+ already defaults to noretry=1, which makes permission faults
> generate immediate L2 protection fault interrupts. GFX 10.1.x was
> inadvertently left out of this default.
>
> Fix:
> Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to
> IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line
> change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer
> generations.
>
> With noretry=1, the existing non-retry fault handler
> (gmc_v10_0_process_interrupt) already decodes and prints the full
> GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS,
> faulting address, VMID, PASID, and process name. No additional logging
> code is needed — the fix is purely routing permission faults to the
> existing, fully-capable non-retry interrupt handler.
>
> v2: Dropped GFX10-specific logging from gmc_v10_0.c and
> kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry
> fault handler, but with noretry=1 permission faults take the non-retry
> path — the v1 retry handler code was dead and would never execute.
>
> Tested on Navi10 (GFX 10.1.10):
> - Execute permission faults now produce immediate, clear output:
>     [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
>      Process amd_close_race pid 13380 thread amd_close_race pid 13384
>       in page at address 0x40001000 from client 0x1b (UTCL2)
>     GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
>          PERMISSION_FAULTS: 0x8
> - No regressions with properly-mapped GPU workloads
>
> Cc: Christian Koenig <[email protected]>
> Cc: Alex Deucher <[email protected]>
> Cc: Felix Kuehling <[email protected]>
> Signed-off-by: Vitaly Prosyak <[email protected]>

Acked-by: Alex Deucher <[email protected]>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 13bec8461cde..a9bb01c6cb58 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -1014,7 +1014,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>                                 gc_ver == IP_VERSION(9, 4, 3) ||
>                                 gc_ver == IP_VERSION(9, 4, 4) ||
>                                 gc_ver == IP_VERSION(9, 5, 0) ||
> -                               gc_ver >= IP_VERSION(10, 3, 0));
> +                               gc_ver >= IP_VERSION(10, 1, 0));
>
>         /* For GFX12.1 B0, set xnack (retry) on as default */
>         if (gc_ver == IP_VERSION(12, 1, 0) && (adev->rev_id & 0xf) == 0x1)
> --
> 2.54.0
>

Reply via email to