I agree with setting noretry=1 for any GFX10.x.

I don't really understand why execute faults need special handling with noretry=0. If a recoverable fault turns out to be non-recoverable, it should be turned into a no-retry fault, which should result in a page fault message in the kernel log. Is this not happening for execute faults? Why?

Or is the problem you're trying to fix, that you lose information about the nature of the fault? I.e. when we replace the PTE with a no-retry-fault encoding, do we lose information that the original PTE was specifically lacking EXEC permission?

Your extra logging changes are also specific to GFX10. Does this mean the problem is GFX10-specific? If it's not GFX10-specific, and extra logging is really justified, I would expect it to happen for all GFX generations (that support some form or retry faults).

Regards,
  Felix


On 2026-05-28 21:44, [email protected] wrote:
From: Vitaly Prosyak <[email protected]>

Problem
=======
On GFX 10.1.x (Navi10, Navi12, Navi14), execute permission faults are
completely invisible. When a GPU buffer is mapped without VM_PAGE_EXECUTABLE
and the CP attempts to fetch shader instructions from it, the hardware
enters an infinite retry loop with zero diagnostic output:

   - No interrupt is generated
   - No dmesg message appears
   - The CP silently stalls until the scheduler timeout fires (~10s)
   - The only symptom is unexplained GPU job timeouts

This was discovered using the IGT amd_close_race stress test when
VM_PAGE_EXECUTABLE was intentionally removed from IB buffer mappings.
The GPU would hang for ~8 seconds per job with no fault information,
making it impossible to diagnose the root cause from kernel logs alone.

Root Cause
==========
GFX 10.1.x defaults to RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1
(noretry=0). With retry enabled, UTCL1 handles permission faults
locally: it keeps re-requesting the translation from UTCL2 in a
tight loop, hoping the PTE permissions will change. Since they never
do for a genuine execute permission violation, this loops forever.

Crucially, UTCL1 never propagates the fault to the interrupt handler
(IH) ring -- the L2 protection fault interrupt is never generated.
The gmc_v10_0_process_interrupt() handler is simply never called.

GFX 10.3+ already defaults to noretry=1 (set in amdgpu_gmc_noretry_set),
which makes ALL permission faults generate immediate L2 protection fault
interrupts. GFX 10.1.x was the only remaining generation where this
problem existed.

Fix
===
1. Extend the noretry default to include GFX 10.1.x by changing the
    threshold from IP_VERSION(10, 3, 0) to IP_VERSION(10, 1, 0) in
    amdgpu_gmc_noretry_set(). This aligns Navi10/12/14 behavior with
    all newer GPU generations.

2. Add explicit execute permission fault logging in
    gmc_v10_0_process_interrupt() so that when an execute fault
    arrives (whether via retry or non-retry path), it is clearly
    identified as an execute permission violation rather than a
    generic page fault.

3. Add execute permission fault detection in the KFD interrupt
    handler (kfd_int_process_v10.c) to extract and log the EXE bit
    from the IH ring entry source data.

With noretry=1, the fault path becomes:
   CP fetch -> UTCL1 miss -> UTCL2 lookup -> PTE found but no X bit ->
   L2 protection fault interrupt -> IH ring -> gmc_v10_0_process_interrupt()

The L2_PROTECTION_FAULT_STATUS register then shows PERMISSION_FAULTS=0x8
(execute bit), and the handler prints the faulting address, process name,
VMID, and PASID.

Test Results (Navi10, IP_VERSION 10.1.10)
=========================================
With amd_close_race test (VM_PAGE_EXECUTABLE intentionally removed):

Before fix:
   - Zero fault messages in dmesg
   - CP stalls for ~8s per job, scheduler timeout kills process
   - No way to identify execute permission as the cause

After fix:
   amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
   amdgpu 0000:03:00.0:  Process amd_close_race pid 13380 thread 
amd_close_race:13384
   amdgpu 0000:03:00.0:   in page at address 0x0000000040001000 from client 
0x1b (UTCL2)
   amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
   amdgpu 0000:03:00.0:      PERMISSION_FAULTS: 0x8
   amdgpu 0000:03:00.0:      MAPPING_ERROR: 0x0
   amdgpu 0000:03:00.0:      RW: 0x0

   - 200 fault interrupts correctly fired during stress test (20 rounds)
   - PERMISSION_FAULTS: 0x8 = execute permission violation
   - Full process identification available
   - No regressions with normal (properly-mapped) GPU workloads

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Felix Kuehling <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c       |  2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        | 23 +++++++++++++++++--
  .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c  |  9 ++++++++
  3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 13bec8461cde..a9bb01c6cb58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -1014,7 +1014,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
                                gc_ver == IP_VERSION(9, 4, 3) ||
                                gc_ver == IP_VERSION(9, 4, 4) ||
                                gc_ver == IP_VERSION(9, 5, 0) ||
-                               gc_ver >= IP_VERSION(10, 3, 0));
+                               gc_ver >= IP_VERSION(10, 1, 0));
/* For GFX12.1 B0, set xnack (retry) on as default */
        if (gc_ver == IP_VERSION(12, 1, 0) && (adev->rev_id & 0xf) == 0x1)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 8523833a74fb..554f514e59f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -102,6 +102,8 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device 
*adev,
  {
        uint32_t vmhub_index = entry->client_id == SOC15_IH_CLIENTID_VMC ?
                               AMDGPU_MMHUB0(0) : AMDGPU_GFXHUB(0);
+       bool exe_fault = !!(entry->src_data[1] &
+                           AMDGPU_GMC9_FAULT_SOURCE_DATA_EXE);
        struct amdgpu_vmhub *hub = &adev->vmhub[vmhub_index];
        bool retry_fault = !!(entry->src_data[1] &
                              AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY);
@@ -117,9 +119,26 @@ static int gmc_v10_0_process_interrupt(struct 
amdgpu_device *adev,
        if (retry_fault) {
                int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
                                                        write_fault);
-               /* Returning 1 here also prevents sending the IV to the KFD */
-               if (ret == 1)
+               /*
+                * For execute permission faults, always fall through to
+                * print the fault info. This makes missing VM_PAGE_EXECUTABLE
+                * mappings visible in dmesg instead of silently stalling
+                * the CP in an infinite retry loop.
+                */
+               if (ret == 1 && exe_fault) {
+                       dev_err_ratelimited(adev->dev,
+                               "[%s] execute permission retry fault "
+                               "(src_id:%u ring:%u vmid:%u pasid:%u "
+                               "addr:0x%016llx flags:0x%02x)\n",
+                               entry->vmid_src ? "mmhub" : "gfxhub",
+                               entry->src_id, entry->ring_id,
+                               entry->vmid, entry->pasid, addr,
+                               (unsigned int)(entry->src_data[1] & 0xff));
+                       /* Fall through to print L2 protection fault status */
+               } else if (ret == 1) {
+                       /* Returning 1 prevents sending the IV to the KFD */
                        return 1;
+               }
        }
if (!amdgpu_sriov_vf(adev)) {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
index 19406ab92c5b..800592bc908c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
@@ -360,6 +360,15 @@ static void event_interrupt_wq_v10(struct kfd_node *dev,
                info.prot_valid = ring_id & 0x08;
                info.prot_read  = ring_id & 0x10;
                info.prot_write = ring_id & 0x20;
+               info.prot_exec  = ih_ring_entry[5] & 0x10;
+
+               if (info.prot_exec)
+                       dev_info_ratelimited(dev->adev->dev,
+                               "KFD: execute permission fault "
+                               "(vmid:%u pasid:%u addr:0x%llx 
src_data1:0x%x)\n",
+                               vmid, pasid,
+                               (uint64_t)info.page_addr << PAGE_SHIFT,
+                               le32_to_cpu(ih_ring_entry[5]));
memset(&exception_data, 0, sizeof(exception_data));
                exception_data.gpu_id = dev->id;

Reply via email to