https://bugzilla.kernel.org/show_bug.cgi?id=220813
Bug ID: 220813
Summary: 3x Radeon RX 7900 XTX Cards Exhibiting Identical PCIe
Bus Dropouts, SMU/GFXOFF Failures, and Full GPU Loss
Under Gaming and Compute Loads
Product: Drivers
Version: 2.5
Hardware: All
OS: Linux
Status: NEW
Severity: high
Priority: P3
Component: Video(DRI - non Intel)
Assignee: [email protected]
Reporter: [email protected]
Regression: No
Created attachment 308979
--> https://bugzilla.kernel.org/attachment.cgi?id=308979&action=edit
Logs captured on Ubuntu in both Gaming and AI workloads demonstrating the
failure pattern
I am reporting a reproducible and severe failure mode affecting three separate
AMD Radeon RX 7900 XTX GPUs across two years, two operating systems, multiple
workloads, and multiple driver stacks. All three cards eventually degrade into
the same failure condition: full GPU disappearance from the PCIe bus, SMU
lock-ups, GFX ring timeouts, SDMA failures, fans ramping to 100%, and permanent
instability until AC power is removed.
This appears to be a systemic hardware or firmware-level issue, not isolated to
one card or one environment.
Hardware and System Information
GPU 1: XFX Radeon RX 7900 XTX Merc 310 Black Edition
Serial Number: A6V070402
GPU 2: XFX Radeon RX 7900 XTX Merc 310 Black Edition
Serial Number: ACB000640
GPU 3 XFX Radeon RX 7900 XTX Merc 310 Black Edition
Serial Number: Y6V013609 (Original first owned card, eventually died, no
display signal, has been RMA'd)
Motherboard: ASUS Prime Pro X570
BIOS: 5031
PSU: Cooler Master MWE Gold 1050 V2 (3 independent PCIe power runs)
OS: Ubuntu 24.04.3 (Noble Numbat)
Kernel: 6.14.0-36-generic
Drivers: Upstream amdgpu (KMS enabled)
Control GPU Testing (Stable GPU's in Same System)
To exclude PSU, motherboard, RAM, OS, kernel, driver, BIOS, and workload
factors, I tested two separate control GPUs in the same hardware:
Control GPU A: PowerColor Radeon RX 7800 XT
Control GPU B: Sapphire Nitro+ Radeon RX 7800 XT
Both 7800 XT cards were tested under:
The same llama.cpp torture workloads
The same gaming workloads (Superposition, Battlefield 2042, Helldivers 2,
Battlefield 6)
The same system, same PSU cables, same PCIe slot, same BIOS config
The same Ubuntu installation and same Windows installation
Result:
Both 7800 XT cards are completely stable.
No crashes, no artefacts, no PCIe dropouts, no SMU/GFXOFF failures.
This confirms the issue is specific to the 7900 XTX GPU family in this
configuration, not the system or software.
This behaviour also occurs under Windows 11, fully updated with latest
Adrenalin drivers.
Crash Behaviour Description
All affected GPUs exhibit the same failure mode:
Display output instantly cuts to black
GPU fans ramp to 100%
The operating system continues running in background
amdgpu reports:
GFX ring timeouts
SDMA ring timeouts
SMU failure to respond
GFXOFF cannot be disabled
The GPU vanishes from the PCIe bus
Driver reset attempts fail with the error:
"GPU reset begin → device lost from bus → GPU Recovery Failed: -19"
A full AC power removal is required to restore partial functionality
After the crash, the GPU often displays visual artefacts and instability even
after reboot
This post-crash corruption persists until PSU is switched off and the
motherboard fully discharges
In some cases, the system will not power on unless the PSU is in the O
position.
These symptoms are identical on all three cards.
Linux Crash Reproduction (Compute Workload)
The failure is highly reproducible with llama.cpp under full GPU load.
Configuration:
ngl = 999
Context = 4096
Threads = 8
Model: Qwen2.5-7B q4_k_m
Continuous infinite loop prompting generation
Approximate crash times:
First crash: 4 hours
Second crash: 1–2 hours
Crashes are guaranteed within a few hours
Glmark2 completes without crashing, suggesting the issue is related to
sustained high-load VRAM/SDMA/GFX conditions, not basic rendering.
Windows Crash Reproduction (Gaming Workloads)
The failure also occurs in normal gaming scenarios on both Linux and Windows.
Games and benchmarks that trigger identical failure:
Unigine Superposition (4K Optimized or higher)
Battlefield 2042
Battlefield 6
Dying Light 2
Helldivers 2
The Windows symptoms are identical:
Black screen
Fans to 100%
GPU disappears until full PSU power cycle
Event Viewer shows GPU timeout attempts and reset failures
Post-crash artefacting on secondary display
This eliminates OS-level or application-specific causes.
Key Kernel Log Excerpts
SMU/GFXOFF failures repeatedly logged:
amdgpu: SMU: response:0xFFFFFFFF for index:41
amdgpu: Failed to disable gfxoff
Ring timeouts:
amdgpu: ring gfx timeout
amdgpu: ring sdma1 timeout
Reset failures:
amdgpu: MES failed to respond to msg=RESET
amdgpu: GPU reset begin!
amdgpu: device lost from bus!
amdgpu: GPU reset end with ret = -19
amdgpu: GPU Recovery Failed: -19
These cascades happen repeatedly until the GPU fully disappears.
Post-Crash Artefacts and PCIe Behaviour
Persistent visual corruption after reboot
Corruption persists until full AC removal
Removing secondary displays does not eliminate artefacts
PCIe bus occasionally behaves unpredictably
System attempts power-on with PSU switch OFF (reported on two GPUs)
Conclusion
Across three separate 7900 XTX GPUs, from different batches and time periods,
the same catastrophic failure mode occurs. This failure manifests identically
across Linux and Windows, gaming and compute workloads, and even after multiple
fresh OS installations.
The issue involves:
GFX/SDMA ring instability
SMU nonresponsiveness
GFXOFF control failure
Complete PCIe device disappearance
Failure of the amdgpu driver to reset the card
Post-crash persistent artefacts
Power state anomalies after crash
This appears to be a firmware-level or architectural issue involving power
management (SMU), GFXOFF, and PCIe bus state transitions under sustained heavy
load.
Any assistance or engineering escalation is appreciated.
I can provide full logs, more crash traces, or additional hardware data upon
request.
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.