[AMD Official Use Only - AMD Internal Distribution Only]

Hi all,

Just a friendly ping on this patch that fixes PLAT-192105 , I'd appreciate a 
review when you have a chance.

Best regards,
Chenglei

-----Original Message-----
From: Xie, Chenglei <[email protected]>
Sent: Tuesday, March 3, 2026 11:19 AM
To: [email protected]
Cc: Chan, Hing Pong <[email protected]>; Luo, Zhigang <[email protected]>; 
Zhang, Hawking <[email protected]>; Xie, Chenglei <[email protected]>
Subject: [PATCH] drm/amdgpu: Fix KIQ fence timeout after GPU reset on GFX v9.4.3

After GPU reset, the hardware queue is cleared and all pending fences are lost. 
However, the fence writeback memory remains stale from before reset, while 
software continues emitting fences and sync_seq keeps incrementing. This causes 
amdgpu_fence_emit_polling() to wait for fences that were lost during reset, 
resulting in -ETIMEDOUT errors.

Fix this by updating the fence writeback memory to match sync_seq after GPU 
reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the hardware's view of 
completed fences with software's view of emitted fences, preventing timeouts 
when waiting for fences that no longer exist.

Signed-off-by: Chenglei Xie <[email protected]>
Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index ad4d442e7345e..6b5fcdd987693 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct 
amdgpu_ring *ring, int xcc_id)
                gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id);
                soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
                mutex_unlock(&adev->srbm_mutex);
+
+               /* Update fence writeback memory to align with software state 
after reset.
+                * After GPU reset, the hardware queue is cleared and all 
pending fences
+                * are lost. The fence writeback memory may be stale from 
before reset. To prevent
+                * waiting for lost fences, update writeback memory to match 
sync_seq.
+                * This avoids waiting for lost fences and prevents timeouts.
+                */
+                if (ring->fence_drv.cpu_addr)
+                       *ring->fence_drv.cpu_addr = 
cpu_to_le32(ring->fence_drv.sync_seq);
        } else {
                memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
                ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
--
2.34.1

Reply via email to