On 5/27/26 22:49, Yunxiang Li wrote:
> gfx_v9_0_hw_fini() unconditionally puts priv_reg_irq, priv_inst_irq,
> bad_op_irq and cp_ecc_error_irq, but the matching gets in
> gfx_v9_0_late_init() and amdgpu_gfx_ras_late_init() may be skipped
> on SR-IOV VF, partial late_init failure, or an earlier IP init
> failure.  When hw_fini then runs, the unmatched puts underflow the
> refcounts and trip the amdgpu_irq_put() WARN:
> 
>   WARNING: CPU: 4 PID: 6367 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:676
>   RIP: amdgpu_irq_put+0xc6/0xe0 [amdgpu]
>   Call Trace:
>    gfx_v9_0_hw_fini+0x200/0x9a0 [amdgpu]
>    amdgpu_ip_block_hw_fini+0x29/0xc0 [amdgpu]
>    amdgpu_device_fini_hw+0x309/0x5f0 [amdgpu]
>    amdgpu_driver_unload_kms+0x7c/0x90 [amdgpu]
>    amdgpu_pci_remove+0x51/0x90 [amdgpu]

That's a good catch.

> Guard each put with amdgpu_irq_enabled() so hw_fini only releases
> IRQs that are currently held.

That's a clear NAK.

This just works around the problem in an incorrect way. The real question is 
why we have the get in late_init()?

Regards,
Christian.

> 
> Fixes: d97b02bb9c7aa ("drm/amdgpu/gfx: disable gfx9 cp_ecc_error_irq only 
> when enabling legacy gfx ras")
> Cc: [email protected]
> Signed-off-by: Yunxiang Li <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index bf270e605949f..e5a3735d98342 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -4057,11 +4057,14 @@ static int gfx_v9_0_hw_fini(struct amdgpu_ip_block 
> *ip_block)
>  {
>       struct amdgpu_device *adev = ip_block->adev;
>  
> -     if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX))
> +     if (amdgpu_irq_enabled(adev, &adev->gfx.cp_ecc_error_irq, 0))
>               amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0);
> -     amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
> -     amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> -     amdgpu_irq_put(adev, &adev->gfx.bad_op_irq, 0);
> +     if (amdgpu_irq_enabled(adev, &adev->gfx.priv_reg_irq, 0))
> +             amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
> +     if (amdgpu_irq_enabled(adev, &adev->gfx.priv_inst_irq, 0))
> +             amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> +     if (amdgpu_irq_enabled(adev, &adev->gfx.bad_op_irq, 0))
> +             amdgpu_irq_put(adev, &adev->gfx.bad_op_irq, 0);
>  
>       /* DF freeze and kcq disable will fail */
>       if (!amdgpu_ras_intr_triggered())

Reply via email to