https://bugzilla.kernel.org/show_bug.cgi?id=220813

            Bug ID: 220813
           Summary: 3x Radeon RX 7900 XTX Cards Exhibiting Identical PCIe
                    Bus Dropouts, SMU/GFXOFF Failures, and Full GPU Loss
                    Under Gaming and Compute Loads
           Product: Drivers
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: high
          Priority: P3
         Component: Video(DRI - non Intel)
          Assignee: [email protected]
          Reporter: [email protected]
        Regression: No

Created attachment 308979
  --> https://bugzilla.kernel.org/attachment.cgi?id=308979&action=edit
Logs captured on Ubuntu in both Gaming and AI workloads demonstrating the
failure pattern

I am reporting a reproducible and severe failure mode affecting three separate
AMD Radeon RX 7900 XTX GPUs across two years, two operating systems, multiple
workloads, and multiple driver stacks. All three cards eventually degrade into
the same failure condition: full GPU disappearance from the PCIe bus, SMU
lock-ups, GFX ring timeouts, SDMA failures, fans ramping to 100%, and permanent
instability until AC power is removed.

This appears to be a systemic hardware or firmware-level issue, not isolated to
one card or one environment.

Hardware and System Information

GPU 1: XFX Radeon RX 7900 XTX Merc 310 Black Edition
Serial Number: A6V070402

GPU 2: XFX Radeon RX 7900 XTX Merc 310 Black Edition
Serial Number: ACB000640

GPU 3 XFX Radeon RX 7900 XTX Merc 310 Black Edition 
Serial Number: Y6V013609 (Original first owned card, eventually died, no
display signal, has been RMA'd)

Motherboard: ASUS Prime Pro X570
BIOS: 5031
PSU: Cooler Master MWE Gold 1050 V2 (3 independent PCIe power runs)
OS: Ubuntu 24.04.3 (Noble Numbat)
Kernel: 6.14.0-36-generic
Drivers: Upstream amdgpu (KMS enabled)

Control GPU Testing (Stable GPU's in Same System)

To exclude PSU, motherboard, RAM, OS, kernel, driver, BIOS, and workload
factors, I tested two separate control GPUs in the same hardware:

Control GPU A: PowerColor Radeon RX 7800 XT
Control GPU B: Sapphire Nitro+ Radeon RX 7800 XT

Both 7800 XT cards were tested under:

The same llama.cpp torture workloads

The same gaming workloads (Superposition, Battlefield 2042, Helldivers 2,
Battlefield 6)

The same system, same PSU cables, same PCIe slot, same BIOS config

The same Ubuntu installation and same Windows installation

Result:
Both 7800 XT cards are completely stable.
No crashes, no artefacts, no PCIe dropouts, no SMU/GFXOFF failures.

This confirms the issue is specific to the 7900 XTX GPU family in this
configuration, not the system or software.

This behaviour also occurs under Windows 11, fully updated with latest
Adrenalin drivers.

Crash Behaviour Description

All affected GPUs exhibit the same failure mode:

Display output instantly cuts to black

GPU fans ramp to 100%

The operating system continues running in background

amdgpu reports:

GFX ring timeouts

SDMA ring timeouts

SMU failure to respond

GFXOFF cannot be disabled

The GPU vanishes from the PCIe bus

Driver reset attempts fail with the error:
"GPU reset begin → device lost from bus → GPU Recovery Failed: -19"

A full AC power removal is required to restore partial functionality

After the crash, the GPU often displays visual artefacts and instability even
after reboot

This post-crash corruption persists until PSU is switched off and the
motherboard fully discharges

In some cases, the system will not power on unless the PSU is in the O
position.

These symptoms are identical on all three cards.

Linux Crash Reproduction (Compute Workload)

The failure is highly reproducible with llama.cpp under full GPU load.
Configuration:

ngl = 999

Context = 4096

Threads = 8

Model: Qwen2.5-7B q4_k_m

Continuous infinite loop prompting generation

Approximate crash times:

First crash: 4 hours

Second crash: 1–2 hours

Crashes are guaranteed within a few hours

Glmark2 completes without crashing, suggesting the issue is related to
sustained high-load VRAM/SDMA/GFX conditions, not basic rendering.

Windows Crash Reproduction (Gaming Workloads)

The failure also occurs in normal gaming scenarios on both Linux and Windows.

Games and benchmarks that trigger identical failure:

Unigine Superposition (4K Optimized or higher)

Battlefield 2042

Battlefield 6

Dying Light 2

Helldivers 2

The Windows symptoms are identical:

Black screen

Fans to 100%

GPU disappears until full PSU power cycle

Event Viewer shows GPU timeout attempts and reset failures

Post-crash artefacting on secondary display

This eliminates OS-level or application-specific causes.

Key Kernel Log Excerpts

SMU/GFXOFF failures repeatedly logged:
amdgpu: SMU: response:0xFFFFFFFF for index:41
amdgpu: Failed to disable gfxoff

Ring timeouts:
amdgpu: ring gfx timeout
amdgpu: ring sdma1 timeout

Reset failures:
amdgpu: MES failed to respond to msg=RESET
amdgpu: GPU reset begin!
amdgpu: device lost from bus!
amdgpu: GPU reset end with ret = -19
amdgpu: GPU Recovery Failed: -19

These cascades happen repeatedly until the GPU fully disappears.

Post-Crash Artefacts and PCIe Behaviour

Persistent visual corruption after reboot

Corruption persists until full AC removal

Removing secondary displays does not eliminate artefacts

PCIe bus occasionally behaves unpredictably

System attempts power-on with PSU switch OFF (reported on two GPUs)

Conclusion

Across three separate 7900 XTX GPUs, from different batches and time periods,
the same catastrophic failure mode occurs. This failure manifests identically
across Linux and Windows, gaming and compute workloads, and even after multiple
fresh OS installations.

The issue involves:

GFX/SDMA ring instability

SMU nonresponsiveness

GFXOFF control failure

Complete PCIe device disappearance

Failure of the amdgpu driver to reset the card

Post-crash persistent artefacts

Power state anomalies after crash

This appears to be a firmware-level or architectural issue involving power
management (SMU), GFXOFF, and PCIe bus state transitions under sustained heavy
load.

Any assistance or engineering escalation is appreciated.
I can provide full logs, more crash traces, or additional hardware data upon
request.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

Reply via email to