From: Lijo Lazar <[email protected]>

[ Upstream commit 8980be03b3f9a4b58197ef95d3b37efa41a25331 ]

VF doesn't enable VCN poison irq in VCNv2.5. Skip releasing it and avoid
call trace during deinitialization.

[   71.913601] [drm] clean up the vf2pf work item
[   71.915088] ------------[ cut here ]------------
[   71.915092] WARNING: CPU: 3 PID: 1079 at 
/tmp/amd.aFkFvSQl/amd/amdgpu/amdgpu_irq.c:641 amdgpu_irq_put+0xc6/0xe0 [amdgpu]
[   71.915355] Modules linked in: amdgpu(OE-) amddrm_ttm_helper(OE) amdttm(OE) 
amddrm_buddy(OE) amdxcp(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) 
drm_suballoc_helper drm_display_helper cec rc_core i2c_algo_bit video wmi 
binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common input_leds joydev 
serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic 
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic 
ghash_clmulni_intel usbhid 8139too sha256_ssse3 sha1_ssse3 hid psmouse bochs 
i2c_i801 ahci drm_vram_helper libahci i2c_smbus lpc_ich drm_ttm_helper 8139cp 
mii ttm aesni_intel crypto_simd cryptd
[   71.915484] CPU: 3 PID: 1079 Comm: rmmod Tainted: G           OE      
6.8.0-87-generic #88~22.04.1-Ubuntu
[   71.915489] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014
[   71.915492] RIP: 0010:amdgpu_irq_put+0xc6/0xe0 [amdgpu]
[   71.915768] Code: 75 84 b8 ea ff ff ff eb d4 44 89 ea 48 89 de 4c 89 e7 e8 
fd fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 31 f6 31 ff e9 55 30 3b c7 <0f> 0b eb 
d4 b8 fe ff ff ff eb a8 e9 b7 3b 8a 00 66 2e 0f 1f 84 00
[   71.915771] RSP: 0018:ffffcf0800eafa30 EFLAGS: 00010246
[   71.915775] RAX: 0000000000000000 RBX: ffff891bda4b0668 RCX: 0000000000000000
[   71.915777] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   71.915779] RBP: ffffcf0800eafa50 R08: 0000000000000000 R09: 0000000000000000
[   71.915781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff891bda480000
[   71.915782] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[   71.915792] FS:  000070cff87c4c40(0000) GS:ffff893abfb80000(0000) 
knlGS:0000000000000000
[   71.915795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.915797] CR2: 00005fa13073e478 CR3: 000000010d634006 CR4: 0000000000770ef0
[   71.915800] PKRU: 55555554
[   71.915802] Call Trace:
[   71.915805]  <TASK>
[   71.915809]  vcn_v2_5_hw_fini+0x19e/0x1e0 [amdgpu]

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Mangesh Gadre <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---

LLM Generated explanations, may be completely bogus:

Now I have a comprehensive understanding. Let me summarize the analysis.

## Analysis

### 1. Commit Message Analysis

The commit fixes a **WARNING/call trace during deinitialization** on AMD
GPU Virtual Functions (VF, i.e., SR-IOV guests). The commit message
includes the full stack trace showing the `WARN_ON` trigger at
`amdgpu_irq_put+0xc6/0xe0` called from `vcn_v2_5_hw_fini+0x19e/0x1e0`,
occurring during `rmmod` of the amdgpu driver.

### 2. Code Change Analysis

**The bug**: In `vcn_v2_5_hw_fini()`, the code unconditionally calls
`amdgpu_irq_put()` for `ras_poison_irq` whenever RAS is supported.
However, for SR-IOV VF (Virtual Function) environments,
`amdgpu_irq_get()` was **never called** on this IRQ source during
initialization. The IRQ enable path is in `amdgpu_vcn_ras_late_init()`
which runs through the RAS block late init — but for VFs, the RAS
interrupt operations are not fully initialized/enabled (as the comment
in the fix says: "VF doesn't enable interrupt operations for RAS").

When `amdgpu_irq_put()` is called on an IRQ that was never enabled
(refcount is 0), line 639 of `amdgpu_irq.c` triggers:
`WARN_ON(!amdgpu_irq_enabled(adev, src, type))`, causing the stack trace
shown in the commit message.

**The fix**: Adds `!amdgpu_sriov_vf(adev)` check before calling
`amdgpu_irq_put()`, so the IRQ release is skipped on VF — matching the
fact that it was never enabled on VF. This is a minimal 2-line change
(adding the VF check to the existing conditional).

### 3. Classification

This is a **bug fix** — it fixes a mismatched IRQ get/put that causes a
WARNING and call trace during driver deinitialization (rmmod) on SR-IOV
VF environments. The fix is:
- Obviously correct (symmetry between init/fini paths)
- Small and surgical (2-line change to an existing conditional)
- Fixes a real user-visible issue (WARNING + call trace during rmmod)
- No new features, no API changes

### 4. Scope and Risk

- **Very small change**: Only adds a VF check to an existing `if`
  condition
- **Low risk**: The change only affects SR-IOV VF environments, and it
  simply skips an operation that should never have run in that context
- **Single file**: Only `vcn_v2_5.c` is modified
- **Well-understood pattern**: Other VCN versions (e.g., vcn_v4_0.c)
  have similar structures, and the author (Lijo Lazar) is an AMD kernel
  developer familiar with the subsystem

### 5. User Impact

This affects AMD GPU users running in SR-IOV virtualized environments
(VMs using AMD GPU virtual functions). When they unload the amdgpu
driver (rmmod), they see a WARNING call trace in the kernel log. While
not a crash, this is a real operational annoyance and the WARN_ON could
trigger panic-on-warn systems.

### 6. Stability

- Reviewed by Mangesh Gadre (AMD engineer)
- Signed off by Alex Deucher (AMD GPU maintainer)
- The fix is logically straightforward — matching the fini path to what
  the init path actually does

### 7. Dependencies

None. The fix is self-contained and uses existing APIs
(`amdgpu_sriov_vf()`) that have been in the kernel for a long time. The
`vcn_v2_5.c` file and the relevant code paths exist in all supported
stable trees that include VCN 2.5 support.

## Verification

- Read `vcn_v2_5.c:506-529` (pre-fix code on current tree) confirming
  the `amdgpu_irq_put` call at line 524-525 lacks the VF check
- Read `amdgpu_irq.c:623-646` confirming the `WARN_ON` at line 639
  matches the stack trace
- Read `amdgpu_vcn.c:1286-1310` confirming `amdgpu_vcn_ras_late_init`
  calls `amdgpu_irq_get` for poison IRQ — but this may not run for VF
  environments
- Verified commit `8980be03b3f9a` exists in tree and matches the
  analyzed patch
- Searched for similar patterns (`sriov_vf` + `ras_poison_irq`) — no
  other instance found, confirming this is the only VCN version patched
  (VCN 2.5 specific)
- The `vcn_v4_0.c` hw_fini was noted by the explore agent as having the
  same pattern (no VF check before irq_put) — this is potentially a
  separate issue for other VCN versions
- Could NOT verify which stable trees contain VCN 2.5 support
  specifically — unverified, but VCN 2.5 has been in the kernel since
  ~5.4-5.5 era

## Conclusion

This is a small, surgical bug fix that resolves a WARNING/call trace
during driver deinitialization on SR-IOV VF environments. It fixes an
asymmetry between the init and fini paths (IRQ never enabled on VF, but
unconditionally released). The fix is obviously correct, low risk, well-
reviewed, and impacts real users of AMD GPU virtualization. It meets all
stable kernel criteria.

**YES**

 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index cebee453871c1..006a154511971 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -521,7 +521,9 @@ static int vcn_v2_5_hw_fini(struct amdgpu_ip_block 
*ip_block)
                     RREG32_SOC15(VCN, i, mmUVD_STATUS)))
                        vinst->set_pg_state(vinst, AMD_PG_STATE_GATE);
 
-               if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__VCN))
+               /* VF doesn't enable interrupt operations for RAS */
+               if (!amdgpu_sriov_vf(adev) &&
+                   amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__VCN))
                        amdgpu_irq_put(adev, &vinst->ras_poison_irq, 0);
        }
 
-- 
2.51.0

Reply via email to