From: Vitaly Prosyak <[email protected]> Problem ======= On GFX 10.1.x (Navi10, Navi12, Navi14), execute permission faults are completely invisible. When a GPU buffer is mapped without VM_PAGE_EXECUTABLE and the CP attempts to fetch shader instructions from it, the hardware enters an infinite retry loop with zero diagnostic output:
- No interrupt is generated - No dmesg message appears - The CP silently stalls until the scheduler timeout fires (~10s) - The only symptom is unexplained GPU job timeouts This was discovered using the IGT amd_close_race stress test when VM_PAGE_EXECUTABLE was intentionally removed from IB buffer mappings. The GPU would hang for ~8 seconds per job with no fault information, making it impossible to diagnose the root cause from kernel logs alone. Root Cause ========== GFX 10.1.x defaults to RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 (noretry=0). With retry enabled, UTCL1 handles permission faults locally: it keeps re-requesting the translation from UTCL2 in a tight loop, hoping the PTE permissions will change. Since they never do for a genuine execute permission violation, this loops forever. Crucially, UTCL1 never propagates the fault to the interrupt handler (IH) ring -- the L2 protection fault interrupt is never generated. The gmc_v10_0_process_interrupt() handler is simply never called. GFX 10.3+ already defaults to noretry=1 (set in amdgpu_gmc_noretry_set), which makes ALL permission faults generate immediate L2 protection fault interrupts. GFX 10.1.x was the only remaining generation where this problem existed. Fix === 1. Extend the noretry default to include GFX 10.1.x by changing the threshold from IP_VERSION(10, 3, 0) to IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This aligns Navi10/12/14 behavior with all newer GPU generations. 2. Add explicit execute permission fault logging in gmc_v10_0_process_interrupt() so that when an execute fault arrives (whether via retry or non-retry path), it is clearly identified as an execute permission violation rather than a generic page fault. 3. Add execute permission fault detection in the KFD interrupt handler (kfd_int_process_v10.c) to extract and log the EXE bit from the IH ring entry source data. With noretry=1, the fault path becomes: CP fetch -> UTCL1 miss -> UTCL2 lookup -> PTE found but no X bit -> L2 protection fault interrupt -> IH ring -> gmc_v10_0_process_interrupt() The L2_PROTECTION_FAULT_STATUS register then shows PERMISSION_FAULTS=0x8 (execute bit), and the handler prints the faulting address, process name, VMID, and PASID. Test Results (Navi10, IP_VERSION 10.1.10) ========================================= With amd_close_race test (VM_PAGE_EXECUTABLE intentionally removed): Before fix: - Zero fault messages in dmesg - CP stalls for ~8s per job, scheduler timeout kills process - No way to identify execute permission as the cause After fix: amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592) amdgpu 0000:03:00.0: Process amd_close_race pid 13380 thread amd_close_race:13384 amdgpu 0000:03:00.0: in page at address 0x0000000040001000 from client 0x1b (UTCL2) amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881 amdgpu 0000:03:00.0: PERMISSION_FAULTS: 0x8 amdgpu 0000:03:00.0: MAPPING_ERROR: 0x0 amdgpu 0000:03:00.0: RW: 0x0 - 200 fault interrupts correctly fired during stress test (20 rounds) - PERMISSION_FAULTS: 0x8 = execute permission violation - Full process identification available - No regressions with normal (properly-mapped) GPU workloads Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Felix Kuehling <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 23 +++++++++++++++++-- .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c | 9 ++++++++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index 13bec8461cde..a9bb01c6cb58 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -1014,7 +1014,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev) gc_ver == IP_VERSION(9, 4, 3) || gc_ver == IP_VERSION(9, 4, 4) || gc_ver == IP_VERSION(9, 5, 0) || - gc_ver >= IP_VERSION(10, 3, 0)); + gc_ver >= IP_VERSION(10, 1, 0)); /* For GFX12.1 B0, set xnack (retry) on as default */ if (gc_ver == IP_VERSION(12, 1, 0) && (adev->rev_id & 0xf) == 0x1) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index 8523833a74fb..554f514e59f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -102,6 +102,8 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device *adev, { uint32_t vmhub_index = entry->client_id == SOC15_IH_CLIENTID_VMC ? AMDGPU_MMHUB0(0) : AMDGPU_GFXHUB(0); + bool exe_fault = !!(entry->src_data[1] & + AMDGPU_GMC9_FAULT_SOURCE_DATA_EXE); struct amdgpu_vmhub *hub = &adev->vmhub[vmhub_index]; bool retry_fault = !!(entry->src_data[1] & AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY); @@ -117,9 +119,26 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device *adev, if (retry_fault) { int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0, write_fault); - /* Returning 1 here also prevents sending the IV to the KFD */ - if (ret == 1) + /* + * For execute permission faults, always fall through to + * print the fault info. This makes missing VM_PAGE_EXECUTABLE + * mappings visible in dmesg instead of silently stalling + * the CP in an infinite retry loop. + */ + if (ret == 1 && exe_fault) { + dev_err_ratelimited(adev->dev, + "[%s] execute permission retry fault " + "(src_id:%u ring:%u vmid:%u pasid:%u " + "addr:0x%016llx flags:0x%02x)\n", + entry->vmid_src ? "mmhub" : "gfxhub", + entry->src_id, entry->ring_id, + entry->vmid, entry->pasid, addr, + (unsigned int)(entry->src_data[1] & 0xff)); + /* Fall through to print L2 protection fault status */ + } else if (ret == 1) { + /* Returning 1 prevents sending the IV to the KFD */ return 1; + } } if (!amdgpu_sriov_vf(adev)) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c index 19406ab92c5b..800592bc908c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c @@ -360,6 +360,15 @@ static void event_interrupt_wq_v10(struct kfd_node *dev, info.prot_valid = ring_id & 0x08; info.prot_read = ring_id & 0x10; info.prot_write = ring_id & 0x20; + info.prot_exec = ih_ring_entry[5] & 0x10; + + if (info.prot_exec) + dev_info_ratelimited(dev->adev->dev, + "KFD: execute permission fault " + "(vmid:%u pasid:%u addr:0x%llx src_data1:0x%x)\n", + vmid, pasid, + (uint64_t)info.page_addr << PAGE_SHIFT, + le32_to_cpu(ih_ring_entry[5])); memset(&exception_data, 0, sizeof(exception_data)); exception_data.gpu_id = dev->id; -- 2.54.0
