[AMD Official Use Only - AMD Internal Distribution Only] > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 10 ++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++ > 2 files changed, 13 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > index cab3196a87fb1..0021e763b753a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > @@ -1124,6 +1124,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, > uint32_t reg, uint32_t xcc_ > if (adev->mes.ring[0].sched.ready) > return amdgpu_mes_rreg(adev, reg, xcc_id); > > + if (amdgpu_in_reset(adev)) > + return ~0; > +
>Please note that the existing logic assumes that kiq access will work fine >even during reset and only could fail under certain reset situations (not all) >- > >https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c#L1107 > > >Also, there are additional things done after full access is released - > >https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5610 > >May be it needs a force completion at the right place somewhere in >amdgpu_device_reset_sriov() as Alex suggested and not to simply block all KIQ >based reg accesses during reset. In baremetal case, it is done during >pre-reset as we don't expect any more packet submission/indirect register >accesses through kernel rings afterwards. > >Thanks, >Lijo Chenglei: After the in_gpu_reset flag is set in the KIQ paths, the current logic allows a short window of time when HW and KIQ can still run to process packets before the HW actually start reset. But once the HW reset stared, HW would stop processing and new jobs in the KIQ ring would fail because of that. If we want to avoid blocking all kiq access during reset, we can have amdgpu_fence_driver_force_completion() called in amdgpu_device_reset_sriov() after - https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5586 where the rings are re-inited. Also, the current logic in amdgpu_device_pre_asic_reset() skips rings without gpu scheduler - https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5810 https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c#L862 So that KIQ rings are skipped and never got force_completion before reset. Issue should be fixed if we force completion on all possible rings on both pre reset(both SRIOV and BM) and post rest(SRIOV only). I will send out new patch based on this. Thanks, Chenglei -----Original Message----- From: Lazar, Lijo <[email protected]> Sent: Tuesday, March 10, 2026 2:40 AM To: Xie, Chenglei <[email protected]>; Deucher, Alexander <[email protected]> Cc: Chan, Hing Pong <[email protected]>; Luo, Zhigang <[email protected]>; [email protected] Subject: Re: [PATCH v2] drm/amdgpu: Avoid KIQ ring access during GPU reset to fix fence timeout On 09-Mar-26 10:39 PM, Chenglei Xie wrote: > [Some people who received this message don't often get email from > [email protected]. Learn why this is important at > https://aka.ms/LearnAboutSenderIdentification ] > > After GPU reset, the hardware queue is cleared and all pending fences > are lost, but the fence writeback memory stays stale. If the driver > keeps submitting to the KIQ ring during reset (e.g. HDP flush), > sync_seq advances while writeback does not, so > amdgpu_fence_emit_polling() waits for lost fences and hits -ETIMEDOUT, > blocking further KIQ use. > > Fix this by skipping KIQ ring use when in reset. > > Signed-off-by: Chenglei Xie <[email protected]> > Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 10 ++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++ > 2 files changed, 13 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > index cab3196a87fb1..0021e763b753a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > @@ -1124,6 +1124,9 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, > uint32_t reg, uint32_t xcc_ > if (adev->mes.ring[0].sched.ready) > return amdgpu_mes_rreg(adev, reg, xcc_id); > > + if (amdgpu_in_reset(adev)) > + return ~0; > + Please note that the existing logic assumes that kiq access will work fine even during reset and only could fail under certain reset situations (not all) - https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c#L1107 Also, there are additional things done after full access is released - https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5610 May be it needs a force completion at the right place somewhere in amdgpu_device_reset_sriov() as Alex suggested and not to simply block all KIQ based reg accesses during reset. In baremetal case, it is done during pre-reset as we don't expect any more packet submission/indirect register accesses through kernel rings afterwards. Thanks, Lijo > BUG_ON(!ring->funcs->emit_rreg); > > spin_lock_irqsave(&kiq->ring_lock, flags); @@ -1202,6 +1205,9 > @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v, > uint3 > return; > } > > + if (amdgpu_in_reset(adev)) > + return; > + > spin_lock_irqsave(&kiq->ring_lock, flags); > r = amdgpu_ring_alloc(ring, 32); > if (r) > @@ -1298,6 +1304,10 @@ int amdgpu_kiq_hdp_flush(struct amdgpu_device *adev) > if (adev->enable_mes_kiq && adev->mes.ring[0].sched.ready) > return amdgpu_mes_hdp_flush(adev); > > + /* Avoid KIQ ring access during reset; caller will use > amdgpu_hdp_flush fallback */ > + if (amdgpu_in_reset(adev)) > + return -EBUSY; > + > if (!ring->funcs->emit_hdp_flush) { > return -EOPNOTSUPP; > } > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > index 20e1395b39882..f9db2b17105b7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > @@ -876,6 +876,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct > amdgpu_device *adev, > return; > } > > + if (amdgpu_in_reset(adev)) > + return; > + > spin_lock_irqsave(&kiq->ring_lock, flags); > amdgpu_ring_alloc(ring, 32); > amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1, > -- > 2.34.1 >
