From: Ionut Nechita <[email protected]> Hi,
This patch addresses critical TLB flush failures that occur during hibernation resume on AMD GPUs, particularly affecting ROCm workloads. Problem: -------- After resuming from hibernation (S4), the amdgpu driver consistently fails TLB invalidation operations with these errors: amdgpu: TLB flush failed for PASID xxxxx amdgpu: failed to write reg 28b4 wait reg 28c6 amdgpu: failed to write reg 1a6f4 wait reg 1a706 These failures cause compute workloads to malfunction or crash, making hibernation unreliable for systems running ROCm/OpenCL applications. Root Cause: ----------- During resume, the KIQ (Kernel Interface Queue) ring is marked as ready (ring.sched.ready = true) before the GPU hardware has fully initialized. When TLB invalidation attempts to use KIQ for register access during this window, the commands fail because the GPU is not yet stable. Solution: --------- This patch introduces a resume_gpu_stable flag that: - Starts as false during resume - Forces TLB invalidation to use the reliable MMIO path initially - Gets set to true after ring tests pass in gfx_v9_0_cp_resume() - Allows switching to the faster KIQ path once GPU is confirmed stable This ensures TLB flushes work correctly during early resume while still benefiting from KIQ-based invalidation after the GPU is fully operational. Testing: -------- Tested on AMD Cezanne (Renoir) with ROCm workloads across multiple hibernation cycles. The patch eliminates all TLB flush failures and restores reliable hibernation support for compute workloads. Impact: ------- Affects all AMD GPUs using KIQ for TLB invalidation, particularly visible on systems with active compute workloads (ROCm, OpenCL). Ionut Nechita (1): drm/amdgpu: Fix TLB flush failures after hibernation resume drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 9 +++++++-- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 10 ++++++++++ drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 +++++- 5 files changed, 29 insertions(+), 3 deletions(-) -- 2.52.0
