[AMD Official Use Only - AMD Internal Distribution Only]

Hi Alex,

I updated the patch that skips KIQ ring access during reset to fix the issue. 
Please help review, thanks!

Regards,
Chenglei

-----Original Message-----
From: Xie, Chenglei <[email protected]>
Sent: Monday, March 9, 2026 1:10 PM
To: Deucher, Alexander <[email protected]>
Cc: Chan, Hing Pong <[email protected]>; Luo, Zhigang <[email protected]>; 
[email protected]; Xie, Chenglei <[email protected]>
Subject: [PATCH v2] drm/amdgpu: Avoid KIQ ring access during GPU reset to fix 
fence timeout

After GPU reset, the hardware queue is cleared and all pending fences are lost, 
but the fence writeback memory stays stale. If the driver keeps submitting to 
the KIQ ring during reset (e.g. HDP flush), sync_seq advances while writeback 
does not, so amdgpu_fence_emit_polling() waits for lost fences and hits 
-ETIMEDOUT, blocking further KIQ use.

Fix this by skipping KIQ ring use when in reset.

Signed-off-by: Chenglei Xie <[email protected]>
Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 10 ++++++++++  
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index cab3196a87fb1..0021e763b753a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -1124,6 +1124,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, 
uint32_t reg, uint32_t xcc_
        if (adev->mes.ring[0].sched.ready)
                return amdgpu_mes_rreg(adev, reg, xcc_id);

+       if (amdgpu_in_reset(adev))
+               return ~0;
+
        BUG_ON(!ring->funcs->emit_rreg);

        spin_lock_irqsave(&kiq->ring_lock, flags); @@ -1202,6 +1205,9 @@ void 
amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint3
                return;
        }

+       if (amdgpu_in_reset(adev))
+               return;
+
        spin_lock_irqsave(&kiq->ring_lock, flags);
        r = amdgpu_ring_alloc(ring, 32);
        if (r)
@@ -1298,6 +1304,10 @@ int amdgpu_kiq_hdp_flush(struct amdgpu_device *adev)
        if (adev->enable_mes_kiq && adev->mes.ring[0].sched.ready)
                return amdgpu_mes_hdp_flush(adev);

+       /* Avoid KIQ ring access during reset; caller will use amdgpu_hdp_flush 
fallback */
+       if (amdgpu_in_reset(adev))
+               return -EBUSY;
+
        if (!ring->funcs->emit_hdp_flush) {
                return -EOPNOTSUPP;
        }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 20e1395b39882..f9db2b17105b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -876,6 +876,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device 
*adev,
                return;
        }

+       if (amdgpu_in_reset(adev))
+               return;
+
        spin_lock_irqsave(&kiq->ring_lock, flags);
        amdgpu_ring_alloc(ring, 32);
        amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1,
--
2.34.1

Reply via email to