gfx9: Fix Ring and IB test fail after mode2

Jiqian Chen Tue, 09 Jun 2026 22:58:09 -0700

For Renior APU with gfx9, in some test scenarios with disabling
ring_reset, like accessing an unmapped invalid address, it can
trigger a gpu job timeout event, then driver uses Mode2 reset
to reset GPU, but after Mode2, the CPC and CPF are still stuck,
that causes compute Ring tests fail. What's more, the HQDs of
MECs are still active, that causes MECs use stale HQDs when MECs
are unhalted before driver restore MQDs, then causes compute IB
tests fail.


So, add sequences to reset CPC and CPF after Mode2, and de-active
HQDs of MECs before unhalting MECs and mapping compute queues.

Signed-off-by: Jiqian Chen <[email protected]>
---
Hi all,

My board is Renior APU with gfx9, smu12. I run a testcase that
accesses an invalid address to trigger a amdgpu_job_timedout()
with disabling ring_reset, so that driver will call mode2 reset
directly. After mode2 reset I found compute Ring tests and compute
IB tests fail randomly on random compute ring.
We checked the scan dump of GPU, we can see the CPC and CPF are
still stuck, that may cause Compute Ring tests fail.
I added printings in driver codes (gfx_v9_0_cp_resume), and found
the HQDs of MECs are still active, that may cause MECs use stale
HQDs when MECs are unhalted before mapping compute queues (restore
MQDs to HQDs).
So, I send this patch to fix above problems.
There are two main changes of my patches:
One is to reset CPC and CPF before resuming KCQ.
Another is to disable HQDs beofre unhalting MECs.
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 47721d0c3781..dc0978bc312c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
 
 static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 {
-       int r, i;
+       u32 tmp;
+       int r, i, j, k;
        struct amdgpu_ring *ring;
 
        if (!(adev->flags & AMD_IS_APU))
@@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
                gfx_v9_0_cp_gfx_enable(adev, false);
        gfx_v9_0_cp_compute_enable(adev, false);
 
+       if ((adev->flags & AMD_IS_APU) &&
+               (adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) 
{
+               /*
+                * CPC and CPF are still stuck after Mode2 reset, that causes 
later
+                * compute ring test fail and then loop Mode2 reset infinitely
+                */
+               tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+               tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
+               tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
+               WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
+               tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+               udelay(50);
+
+               tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
+                               GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
+               WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
+               tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+               udelay(50);
+
+               /*
+                * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD 
to
+                * prevent MEC use stale HQD when MEC unhalted before restoring 
MQD.
+                * Otherwise, later compute IB test may fail
+                */
+               for (i = 0; i < adev->gfx.mec.num_mec; i++) {
+                       for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
+                               for (k = 0; k < 
adev->gfx.mec.num_queue_per_pipe; k++) {
+                                       mutex_lock(&adev->srbm_mutex);
+                                       soc15_grbm_select(adev, i + 1, j, k, 0, 
0);
+                                       WREG32_SOC15_RLC(GC, 0, 
mmCP_HQD_ACTIVE, 0);
+                                       soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+                                       mutex_unlock(&adev->srbm_mutex);
+                               }
+                       }
+               }
+       }
+
        r = gfx_v9_0_kiq_resume(adev);
        if (r)
                return r;
-- 
2.39.5

[PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2

Reply via email to