amdgpu: Fix TLB flush failures after hibernation resume

Christian König Thu, 08 Jan 2026 04:36:35 -0800

On 1/6/26 13:59, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <[email protected]>
> 
> After resume from hibernation, the amdgpu driver experiences TLB
> flush failures with errors:


In general we don't support hibernation with the driver.

> 
>   amdgpu: TLB flush failed for PASID xxxxx
>   amdgpu: failed to write reg 28b4 wait reg 28c6
>   amdgpu: failed to write reg 1a6f4 wait reg 1a706
> 
> Root Cause:
> -----------
> The KIQ (Kernel Interface Queue) ring is marked as ready
> (ring.sched.ready = true) during resume, but the hardware is not
> fully functional yet.

Well since that is the root cause we should probably fix that.

> When TLB invalidation attempts to use KIQ
> for register access, the commands fail because the GPU hasn't
> completed initialization.
> 
> Solution:
> ---------
> 1. Add resume_gpu_stable flag (initially false on resume)
> 2. Force TLB invalidation to use direct MMIO path instead of KIQ
>    when resume_gpu_stable is false

That is a really bad idea. So absolutely clear NAK to this patch here.

> 3. After ring tests pass in gfx_v9_0_cp_resume(), set
>    resume_gpu_stable to true
> 4. From that point forward, use faster KIQ path for TLB invalidation
> 
> This ensures TLB flushes work correctly during early resume while
> still benefiting from KIQ-based invalidation after the GPU is stable.

No it doesn't. This patch only works by coincident and not proper engineering.

This only works because the TLB flush is most likely not necesssary.

Question is why the KIQ is not up and running before we do anything with it?

Regards,
Christian.

> 
> Changes:
> --------
> - amdgpu.h: Add resume_gpu_stable flag to amdgpu_device
> - amdgpu_device.c: Initialize resume_gpu_stable to false on resume
> - amdgpu_gmc.c: Check resume_gpu_stable in flush_gpu_tlb_pasid
> - gfx_v9_0.c: Set resume_gpu_stable after ring tests pass
> - gmc_v9_0.c: Check resume_gpu_stable before using KIQ path
> 
> Tested on AMD Cezanne (Renoir) with ROCm workloads after hibernation.
> Result: Eliminates TLB flush failures on resume.
> 
> Signed-off-by: Ionut Nechita <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 +++++++--
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 +++++-
>  5 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 9f9774f58ce1c..6bf4f6c90164c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1225,6 +1225,7 @@ struct amdgpu_device {
>       bool                            in_s4;
>       bool                            in_s0ix;
>       suspend_state_t                 last_suspend_state;
> +     bool                            resume_gpu_stable;
>  
>       enum pp_mp1_state               mp1_state;
>       struct amdgpu_doorbell_index doorbell_index;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 12201b8e99b3f..440d86ed1e0d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5457,6 +5457,12 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> notify_clients)
>               goto exit;
>       }
>  
> +     /*
> +      * Set resume_gpu_stable to false BEFORE KFD resume to ensure
> +      * extended timeouts are used for TLB flushes during hibernation 
> recovery
> +      */
> +     adev->resume_gpu_stable = false;
> +
>       r = amdgpu_amdkfd_resume(adev, !amdgpu_sriov_vf(adev) && 
> !adev->in_runpm);
>       if (r)
>               goto exit;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 869bceb0fe2c6..83fe30f0ada75 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -731,7 +731,12 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device 
> *adev, uint16_t pasid,
>       if (!down_read_trylock(&adev->reset_domain->sem))
>               return 0;
>  
> -     if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
> +     /*
> +      * After hibernation resume, KIQ may report as ready but not be fully
> +      * functional. Use direct MMIO path until GPU is confirmed stable.
> +      */
> +     if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> +         !adev->resume_gpu_stable) {
>               if (adev->gmc.flush_tlb_needs_extra_type_2)
>                       adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>                                                                2, all_hub,
> @@ -835,9 +840,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct 
> amdgpu_device *adev,
>               goto failed_kiq;
>  
>       might_sleep();
> +
>       while (r < 1 && cnt++ < MAX_KIQ_REG_TRY &&
>              !amdgpu_reset_pending(adev->reset_domain)) {
> -
>               msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
>               r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>       }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 0148d7ff34d99..fbd07b455b915 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3985,6 +3985,16 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device 
> *adev)
>               amdgpu_ring_test_helper(ring);
>       }
>  
> +     /*
> +      * After successful ring tests, mark GPU as stable for resume.
> +      * This allows KIQ-based TLB invalidation to be used instead of
> +      * slower direct MMIO path.
> +      */
> +     if (!adev->resume_gpu_stable) {
> +             adev->resume_gpu_stable = true;
> +             dev_info(adev->dev, "GPU rings verified, enabling KIQ path\n");
> +     }
> +
>       gfx_v9_0_enable_gui_idle_interrupt(adev, true);
>  
>       return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 8ad7519f7b581..8a0202f6b3e3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -855,9 +855,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
>  
>       /* This is necessary for SRIOV as well as for GFXOFF to function
>        * properly under bare metal
> +      *
> +      * After hibernation resume, KIQ may report as ready but not be fully
> +      * functional. Use direct MMIO path until GPU is confirmed stable.
>        */
>       if (adev->gfx.kiq[inst].ring.sched.ready &&
> -         (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> +         (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
> +         adev->resume_gpu_stable) {
>               uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
>               uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
>

Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume

Reply via email to