From: Vitaly Prosyak <[email protected]>
Problem
=======
On GFX 10.1.x (Navi10, Navi12, Navi14), execute permission faults are
completely invisible. When a GPU buffer is mapped without VM_PAGE_EXECUTABLE
and the CP attempts to fetch shader instructions from it, the hardware
enters an infinite retry loop with zero diagnostic output:
- No interrupt is generated
- No dmesg message appears
- The CP silently stalls until the scheduler timeout fires (~10s)
- The only symptom is unexplained GPU job timeouts
This was discovered using the IGT amd_close_race stress test when
VM_PAGE_EXECUTABLE was intentionally removed from IB buffer mappings.
The GPU would hang for ~8 seconds per job with no fault information,
making it impossible to diagnose the root cause from kernel logs alone.
Root Cause
==========
GFX 10.1.x defaults to RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1
(noretry=0). With retry enabled, UTCL1 handles permission faults
locally: it keeps re-requesting the translation from UTCL2 in a
tight loop, hoping the PTE permissions will change. Since they never
do for a genuine execute permission violation, this loops forever.
Crucially, UTCL1 never propagates the fault to the interrupt handler
(IH) ring -- the L2 protection fault interrupt is never generated.
The gmc_v10_0_process_interrupt() handler is simply never called.
GFX 10.3+ already defaults to noretry=1 (set in amdgpu_gmc_noretry_set),
which makes ALL permission faults generate immediate L2 protection fault
interrupts. GFX 10.1.x was the only remaining generation where this
problem existed.
Fix
===
1. Extend the noretry default to include GFX 10.1.x by changing the
threshold from IP_VERSION(10, 3, 0) to IP_VERSION(10, 1, 0) in
amdgpu_gmc_noretry_set(). This aligns Navi10/12/14 behavior with
all newer GPU generations.
2. Add explicit execute permission fault logging in
gmc_v10_0_process_interrupt() so that when an execute fault
arrives (whether via retry or non-retry path), it is clearly
identified as an execute permission violation rather than a
generic page fault.
3. Add execute permission fault detection in the KFD interrupt
handler (kfd_int_process_v10.c) to extract and log the EXE bit
from the IH ring entry source data.
With noretry=1, the fault path becomes:
CP fetch -> UTCL1 miss -> UTCL2 lookup -> PTE found but no X bit ->
L2 protection fault interrupt -> IH ring -> gmc_v10_0_process_interrupt()
The L2_PROTECTION_FAULT_STATUS register then shows PERMISSION_FAULTS=0x8
(execute bit), and the handler prints the faulting address, process name,
VMID, and PASID.
Test Results (Navi10, IP_VERSION 10.1.10)
=========================================
With amd_close_race test (VM_PAGE_EXECUTABLE intentionally removed):
Before fix:
- Zero fault messages in dmesg
- CP stalls for ~8s per job, scheduler timeout kills process
- No way to identify execute permission as the cause
After fix:
amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
amdgpu 0000:03:00.0: Process amd_close_race pid 13380 thread
amd_close_race:13384
amdgpu 0000:03:00.0: in page at address 0x0000000040001000 from client
0x1b (UTCL2)
amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
amdgpu 0000:03:00.0: PERMISSION_FAULTS: 0x8
amdgpu 0000:03:00.0: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: RW: 0x0
- 200 fault interrupts correctly fired during stress test (20 rounds)
- PERMISSION_FAULTS: 0x8 = execute permission violation
- Full process identification available
- No regressions with normal (properly-mapped) GPU workloads
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Felix Kuehling <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +-
drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 23 +++++++++++++++++--
.../gpu/drm/amd/amdkfd/kfd_int_process_v10.c | 9 ++++++++
3 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 13bec8461cde..a9bb01c6cb58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -1014,7 +1014,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
gc_ver == IP_VERSION(9, 4, 3) ||
gc_ver == IP_VERSION(9, 4, 4) ||
gc_ver == IP_VERSION(9, 5, 0) ||
- gc_ver >= IP_VERSION(10, 3, 0));
+ gc_ver >= IP_VERSION(10, 1, 0));
/* For GFX12.1 B0, set xnack (retry) on as default */
if (gc_ver == IP_VERSION(12, 1, 0) && (adev->rev_id & 0xf) == 0x1)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 8523833a74fb..554f514e59f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -102,6 +102,8 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device
*adev,
{
uint32_t vmhub_index = entry->client_id == SOC15_IH_CLIENTID_VMC ?
AMDGPU_MMHUB0(0) : AMDGPU_GFXHUB(0);
+ bool exe_fault = !!(entry->src_data[1] &
+ AMDGPU_GMC9_FAULT_SOURCE_DATA_EXE);
struct amdgpu_vmhub *hub = &adev->vmhub[vmhub_index];
bool retry_fault = !!(entry->src_data[1] &
AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY);
@@ -117,9 +119,26 @@ static int gmc_v10_0_process_interrupt(struct
amdgpu_device *adev,
if (retry_fault) {
int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
write_fault);
- /* Returning 1 here also prevents sending the IV to the KFD */
- if (ret == 1)
+ /*
+ * For execute permission faults, always fall through to
+ * print the fault info. This makes missing VM_PAGE_EXECUTABLE
+ * mappings visible in dmesg instead of silently stalling
+ * the CP in an infinite retry loop.
+ */
+ if (ret == 1 && exe_fault) {
+ dev_err_ratelimited(adev->dev,
+ "[%s] execute permission retry fault "
+ "(src_id:%u ring:%u vmid:%u pasid:%u "
+ "addr:0x%016llx flags:0x%02x)\n",
+ entry->vmid_src ? "mmhub" : "gfxhub",
+ entry->src_id, entry->ring_id,
+ entry->vmid, entry->pasid, addr,
+ (unsigned int)(entry->src_data[1] & 0xff));
+ /* Fall through to print L2 protection fault status */
+ } else if (ret == 1) {
+ /* Returning 1 prevents sending the IV to the KFD */
return 1;
+ }
}
if (!amdgpu_sriov_vf(adev)) {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
index 19406ab92c5b..800592bc908c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
@@ -360,6 +360,15 @@ static void event_interrupt_wq_v10(struct kfd_node *dev,
info.prot_valid = ring_id & 0x08;
info.prot_read = ring_id & 0x10;
info.prot_write = ring_id & 0x20;
+ info.prot_exec = ih_ring_entry[5] & 0x10;
+
+ if (info.prot_exec)
+ dev_info_ratelimited(dev->adev->dev,
+ "KFD: execute permission fault "
+ "(vmid:%u pasid:%u addr:0x%llx
src_data1:0x%x)\n",
+ vmid, pasid,
+ (uint64_t)info.page_addr << PAGE_SHIFT,
+ le32_to_cpu(ih_ring_entry[5]));
memset(&exception_data, 0, sizeof(exception_data));
exception_data.gpu_id = dev->id;