AMD General

Reviewed-by: Alex Deucher <[email protected]>
________________________________
From: amd-gfx <[email protected]> on behalf of Yunxiang Li 
<[email protected]>
Sent: Friday, June 5, 2026 11:02 AM
To: Deucher, Alexander <[email protected]>; Koenig, Christian 
<[email protected]>
Cc: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; Zhang, 
Hawking <[email protected]>; [email protected] 
<[email protected]>; Li, Yunxiang (Teddy) <[email protected]>
Subject: [PATCH] drm/amdgpu: skip already suspended IP blocks in 
ip_suspend_phase2

The GPU reload test (S3 / mode1 reset / module reload) triggers a
WARN_ON in amdgpu_irq_put() on gfx10 when unloading amdgpu:

  WARNING: CPU: 0 PID: 2314 at amd/amdgpu/amdgpu_irq.c:676 
amdgpu_irq_put+0xc3/0xe0 [amdgpu]
  Call Trace:
   gfx_v10_0_hw_fini+0x41/0x150 [amdgpu]
   amdgpu_ip_block_hw_fini+0x29/0xc0 [amdgpu]
   amdgpu_device_fini_hw+0x315/0x610 [amdgpu]
   amdgpu_driver_unload_kms+0x7c/0x90 [amdgpu]
   amdgpu_pci_remove+0x51/0x90 [amdgpu]

amdgpu_device_ip_resume_phase2() skips IP blocks whose status.hw is
already set, but amdgpu_device_ip_suspend_phase2() never had the
matching guard, so a block can be suspended twice (e.g. a reset or
recovery issued while the device is already suspended).  The second
suspend runs hw_fini again, which now releases the gfx fault IRQs
unconditionally, dropping a refcount that is already zero and tripping
the WARN_ON in amdgpu_irq_put().

The fault/EOP IRQ get/put were balanced through late_init/hw_fini
before, which masked the double-suspend; moving the get into hw_init
made the suspend/resume asymmetry visible as an IRQ refcount underflow.

Honor status.hw in ip_suspend_phase2() so suspend mirrors resume and a
block is only torn down once.

Fixes: 3402365f4ca8 ("drm/amdgpu/gfx: move fault and EOP IRQ get/put to 
hw_init/hw_fini")
Fixes: 482f0e538580 ("drm/amdgpu: fix double ucode load by PSP(v3)")
Signed-off-by: Yunxiang Li <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6608780ffef2f..dc8c650fc3416 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3044,7 +3044,7 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
                 amdgpu_dpm_gfx_state_change(adev, sGpuChangeState_D3Entry);

         for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
-               if (!adev->ip_blocks[i].status.valid)
+               if (!adev->ip_blocks[i].status.valid || 
!adev->ip_blocks[i].status.hw)
                         continue;
                 /* displays are handled in phase1 */
                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE)
--
2.51.2

Reply via email to