After GPU reset, the hardware queue is cleared and all pending fences are lost. However, the fence writeback memory remains stale from before reset, while software continues emitting fences and sync_seq keeps incrementing. This causes amdgpu_fence_emit_polling() to wait for fences that were lost during reset, resulting in -ETIMEDOUT errors.
Fix this by updating the fence writeback memory to match sync_seq after GPU reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the hardware's view of completed fences with software's view of emitted fences, preventing timeouts when waiting for fences that no longer exist. Signed-off-by: Chenglei Xie <[email protected]> Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f --- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c index ad4d442e7345e..6b5fcdd987693 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c @@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct amdgpu_ring *ring, int xcc_id) gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id); soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id)); mutex_unlock(&adev->srbm_mutex); + + /* Update fence writeback memory to align with software state after reset. + * After GPU reset, the hardware queue is cleared and all pending fences + * are lost. The fence writeback memory may be stale from before reset. To prevent + * waiting for lost fences, update writeback memory to match sync_seq. + * This avoids waiting for lost fences and prevents timeouts. + */ + if (ring->fence_drv.cpu_addr) + *ring->fence_drv.cpu_addr = cpu_to_le32(ring->fence_drv.sync_seq); } else { memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation)); ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF; -- 2.34.1
