[AMD Official Use Only - AMD Internal Distribution Only]

>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 10 ++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c |  3 +++
>   2 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index cab3196a87fb1..0021e763b753a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -1124,6 +1124,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, 
> uint32_t reg, uint32_t xcc_
>          if (adev->mes.ring[0].sched.ready)
>                  return amdgpu_mes_rreg(adev, reg, xcc_id);
>
> +       if (amdgpu_in_reset(adev))
> +               return ~0;
> +

>Please note that the existing logic assumes that kiq access will work fine 
>even during reset and only could fail under certain reset situations (not all) 
>-
>
>https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c#L1107
>
>
>Also, there are additional things done after full access is released -
>
>https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5610
>
>May be it needs a force completion at the right place somewhere in
>amdgpu_device_reset_sriov() as Alex suggested and not to simply block all KIQ 
>based reg accesses during reset. In baremetal case, it is done during 
>pre-reset as we don't expect any more packet submission/indirect register 
>accesses through kernel rings afterwards.
>
>Thanks,
>Lijo

Chenglei: After the in_gpu_reset flag is set in the KIQ paths, the current 
logic allows a short window of time when HW and KIQ can still run to process 
packets before the HW actually start reset. But once the HW reset stared, HW 
would stop processing and new jobs in the KIQ ring would fail because of that.

If we want to avoid blocking all kiq access during reset, we can have 
amdgpu_fence_driver_force_completion() called in amdgpu_device_reset_sriov() 
after -
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5586
where the rings are re-inited.

Also, the current logic in amdgpu_device_pre_asic_reset() skips rings without 
gpu scheduler -
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5810
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c#L862
So that KIQ rings are skipped and never got force_completion before reset.

Issue should be fixed if we force completion on all possible rings on both pre 
reset(both SRIOV and BM) and post rest(SRIOV only). I will send out new patch 
based on this.

Thanks,
Chenglei

-----Original Message-----
From: Lazar, Lijo <[email protected]>
Sent: Tuesday, March 10, 2026 2:40 AM
To: Xie, Chenglei <[email protected]>; Deucher, Alexander 
<[email protected]>
Cc: Chan, Hing Pong <[email protected]>; Luo, Zhigang <[email protected]>; 
[email protected]
Subject: Re: [PATCH v2] drm/amdgpu: Avoid KIQ ring access during GPU reset to 
fix fence timeout



On 09-Mar-26 10:39 PM, Chenglei Xie wrote:
> [Some people who received this message don't often get email from
> [email protected]. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
>
> After GPU reset, the hardware queue is cleared and all pending fences
> are lost, but the fence writeback memory stays stale. If the driver
> keeps submitting to the KIQ ring during reset (e.g. HDP flush),
> sync_seq advances while writeback does not, so
> amdgpu_fence_emit_polling() waits for lost fences and hits -ETIMEDOUT, 
> blocking further KIQ use.
>
> Fix this by skipping KIQ ring use when in reset.
>
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 10 ++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c |  3 +++
>   2 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index cab3196a87fb1..0021e763b753a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -1124,6 +1124,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, 
> uint32_t reg, uint32_t xcc_
>          if (adev->mes.ring[0].sched.ready)
>                  return amdgpu_mes_rreg(adev, reg, xcc_id);
>
> +       if (amdgpu_in_reset(adev))
> +               return ~0;
> +

Please note that the existing logic assumes that kiq access will work fine even 
during reset and only could fail under certain reset situations (not all) -

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c#L1107

Also, there are additional things done after full access is released -

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5610

May be it needs a force completion at the right place somewhere in
amdgpu_device_reset_sriov() as Alex suggested and not to simply block all KIQ 
based reg accesses during reset. In baremetal case, it is done during pre-reset 
as we don't expect any more packet submission/indirect register accesses 
through kernel rings afterwards.

Thanks,
Lijo

>          BUG_ON(!ring->funcs->emit_rreg);
>
>          spin_lock_irqsave(&kiq->ring_lock, flags); @@ -1202,6 +1205,9
> @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v, 
> uint3
>                  return;
>          }
>
> +       if (amdgpu_in_reset(adev))
> +               return;
> +
>          spin_lock_irqsave(&kiq->ring_lock, flags);
>          r = amdgpu_ring_alloc(ring, 32);
>          if (r)
> @@ -1298,6 +1304,10 @@ int amdgpu_kiq_hdp_flush(struct amdgpu_device *adev)
>          if (adev->enable_mes_kiq && adev->mes.ring[0].sched.ready)
>                  return amdgpu_mes_hdp_flush(adev);
>
> +       /* Avoid KIQ ring access during reset; caller will use 
> amdgpu_hdp_flush fallback */
> +       if (amdgpu_in_reset(adev))
> +               return -EBUSY;
> +
>          if (!ring->funcs->emit_hdp_flush) {
>                  return -EOPNOTSUPP;
>          }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 20e1395b39882..f9db2b17105b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -876,6 +876,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct 
> amdgpu_device *adev,
>                  return;
>          }
>
> +       if (amdgpu_in_reset(adev))
> +               return;
> +
>          spin_lock_irqsave(&kiq->ring_lock, flags);
>          amdgpu_ring_alloc(ring, 32);
>          amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1,
> --
> 2.34.1
>

Reply via email to