From: Vitaly Prosyak <[email protected]>

Problem
=======
On GFX 10.1.x (Navi10, Navi12, Navi14), execute permission faults are
completely invisible. When a GPU buffer is mapped without VM_PAGE_EXECUTABLE
and the CP attempts to fetch shader instructions from it, the hardware
enters an infinite retry loop with zero diagnostic output:

  - No interrupt is generated
  - No dmesg message appears
  - The CP silently stalls until the scheduler timeout fires (~10s)
  - The only symptom is unexplained GPU job timeouts

This was discovered using the IGT amd_close_race stress test when
VM_PAGE_EXECUTABLE was intentionally removed from IB buffer mappings.
The GPU would hang for ~8 seconds per job with no fault information,
making it impossible to diagnose the root cause from kernel logs alone.

Root Cause
==========
GFX 10.1.x defaults to RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1
(noretry=0). With retry enabled, UTCL1 handles permission faults
locally: it keeps re-requesting the translation from UTCL2 in a
tight loop, hoping the PTE permissions will change. Since they never
do for a genuine execute permission violation, this loops forever.

Crucially, UTCL1 never propagates the fault to the interrupt handler
(IH) ring -- the L2 protection fault interrupt is never generated.
The gmc_v10_0_process_interrupt() handler is simply never called.

GFX 10.3+ already defaults to noretry=1 (set in amdgpu_gmc_noretry_set),
which makes ALL permission faults generate immediate L2 protection fault
interrupts. GFX 10.1.x was the only remaining generation where this
problem existed.

Fix
===
1. Extend the noretry default to include GFX 10.1.x by changing the
   threshold from IP_VERSION(10, 3, 0) to IP_VERSION(10, 1, 0) in
   amdgpu_gmc_noretry_set(). This aligns Navi10/12/14 behavior with
   all newer GPU generations.

2. Add explicit execute permission fault logging in
   gmc_v10_0_process_interrupt() so that when an execute fault
   arrives (whether via retry or non-retry path), it is clearly
   identified as an execute permission violation rather than a
   generic page fault.

3. Add execute permission fault detection in the KFD interrupt
   handler (kfd_int_process_v10.c) to extract and log the EXE bit
   from the IH ring entry source data.

With noretry=1, the fault path becomes:
  CP fetch -> UTCL1 miss -> UTCL2 lookup -> PTE found but no X bit ->
  L2 protection fault interrupt -> IH ring -> gmc_v10_0_process_interrupt()

The L2_PROTECTION_FAULT_STATUS register then shows PERMISSION_FAULTS=0x8
(execute bit), and the handler prints the faulting address, process name,
VMID, and PASID.

Test Results (Navi10, IP_VERSION 10.1.10)
=========================================
With amd_close_race test (VM_PAGE_EXECUTABLE intentionally removed):

Before fix:
  - Zero fault messages in dmesg
  - CP stalls for ~8s per job, scheduler timeout kills process
  - No way to identify execute permission as the cause

After fix:
  amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
  amdgpu 0000:03:00.0:  Process amd_close_race pid 13380 thread 
amd_close_race:13384
  amdgpu 0000:03:00.0:   in page at address 0x0000000040001000 from client 0x1b 
(UTCL2)
  amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
  amdgpu 0000:03:00.0:      PERMISSION_FAULTS: 0x8
  amdgpu 0000:03:00.0:      MAPPING_ERROR: 0x0
  amdgpu 0000:03:00.0:      RW: 0x0

  - 200 fault interrupts correctly fired during stress test (20 rounds)
  - PERMISSION_FAULTS: 0x8 = execute permission violation
  - Full process identification available
  - No regressions with normal (properly-mapped) GPU workloads

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Felix Kuehling <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        | 23 +++++++++++++++++--
 .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c  |  9 ++++++++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 13bec8461cde..a9bb01c6cb58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -1014,7 +1014,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
                                gc_ver == IP_VERSION(9, 4, 3) ||
                                gc_ver == IP_VERSION(9, 4, 4) ||
                                gc_ver == IP_VERSION(9, 5, 0) ||
-                               gc_ver >= IP_VERSION(10, 3, 0));
+                               gc_ver >= IP_VERSION(10, 1, 0));
 
        /* For GFX12.1 B0, set xnack (retry) on as default */
        if (gc_ver == IP_VERSION(12, 1, 0) && (adev->rev_id & 0xf) == 0x1)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 8523833a74fb..554f514e59f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -102,6 +102,8 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device 
*adev,
 {
        uint32_t vmhub_index = entry->client_id == SOC15_IH_CLIENTID_VMC ?
                               AMDGPU_MMHUB0(0) : AMDGPU_GFXHUB(0);
+       bool exe_fault = !!(entry->src_data[1] &
+                           AMDGPU_GMC9_FAULT_SOURCE_DATA_EXE);
        struct amdgpu_vmhub *hub = &adev->vmhub[vmhub_index];
        bool retry_fault = !!(entry->src_data[1] &
                              AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY);
@@ -117,9 +119,26 @@ static int gmc_v10_0_process_interrupt(struct 
amdgpu_device *adev,
        if (retry_fault) {
                int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
                                                        write_fault);
-               /* Returning 1 here also prevents sending the IV to the KFD */
-               if (ret == 1)
+               /*
+                * For execute permission faults, always fall through to
+                * print the fault info. This makes missing VM_PAGE_EXECUTABLE
+                * mappings visible in dmesg instead of silently stalling
+                * the CP in an infinite retry loop.
+                */
+               if (ret == 1 && exe_fault) {
+                       dev_err_ratelimited(adev->dev,
+                               "[%s] execute permission retry fault "
+                               "(src_id:%u ring:%u vmid:%u pasid:%u "
+                               "addr:0x%016llx flags:0x%02x)\n",
+                               entry->vmid_src ? "mmhub" : "gfxhub",
+                               entry->src_id, entry->ring_id,
+                               entry->vmid, entry->pasid, addr,
+                               (unsigned int)(entry->src_data[1] & 0xff));
+                       /* Fall through to print L2 protection fault status */
+               } else if (ret == 1) {
+                       /* Returning 1 prevents sending the IV to the KFD */
                        return 1;
+               }
        }
 
        if (!amdgpu_sriov_vf(adev)) {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
index 19406ab92c5b..800592bc908c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
@@ -360,6 +360,15 @@ static void event_interrupt_wq_v10(struct kfd_node *dev,
                info.prot_valid = ring_id & 0x08;
                info.prot_read  = ring_id & 0x10;
                info.prot_write = ring_id & 0x20;
+               info.prot_exec  = ih_ring_entry[5] & 0x10;
+
+               if (info.prot_exec)
+                       dev_info_ratelimited(dev->adev->dev,
+                               "KFD: execute permission fault "
+                               "(vmid:%u pasid:%u addr:0x%llx 
src_data1:0x%x)\n",
+                               vmid, pasid,
+                               (uint64_t)info.page_addr << PAGE_SHIFT,
+                               le32_to_cpu(ih_ring_entry[5]));
 
                memset(&exception_data, 0, sizeof(exception_data));
                exception_data.gpu_id = dev->id;
-- 
2.54.0

Reply via email to