Hello, I am reporting a reproducible GPU hang on Linux 6.19-rc3 involving the amdgpu MES control path, which after a long-running compute workload escalates into global fence starvation across multiple GPU rings.
### Hardware/software * Platform: AZW SER9 (Ryzen AI 9 HX 370) * GPU: Radeon 880M / 890M (gfx1150, Strix) * BIOS: SER9T304 (11/13/2025) * Kernel: Linux 6.19.0-rc3 (vanilla git, no downstream patches) * Firmware: linux-firmware current as of Dec 2025 * Desktop: GNOME Wayland * amdgpu: MES enabled (default) ### Workload * Sustained GPU compute workload (training loop) * Runtime before failure: ~13 hours * No suspend/resume, power gating, or thermal events * System otherwise idle ### Failure sequence After approximately 13 hours, the following sequence occurs: 1. **MES control-path failure** ``` amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) amdgpu: failed to reg_write_reg_wait ``` 2. **MES ring buffer saturation** ``` amdgpu: MES ring buffer is full. ``` This message repeats continuously. 3. **Global fence starvation** Shortly after the MES ring fills, fence fallback timers begin expiring on multiple rings: ``` amdgpu: Fence fallback timer expired on ring vcn_unified_0 amdgpu: Fence fallback timer expired on ring sdma0 amdgpu: Fence fallback timer expired on ring gfx_0.0.0 amdgpu: Fence fallback timer expired on ring comp_1.3.0 ``` These messages persist until reboot. ### Resulting behavior * GPU compute and graphics are effectively deadlocked * No forward progress on any affected ring * GNOME becomes unresponsive * SSH and local console remain functional * amdgpu reset does not recover the device * Full reboot is required ### Interpretation This appears to be a long-run livelock or resource exhaustion in the scheduling/control path: * MES stops responding to control messages * The MES ring buffer fills and never drains * Dependent rings stop retiring fences * Fence fallback timers expire, indicating loss of forward progress * No successful recovery path is triggered This does not appear to be a client misuse or transient reset issue, but rather a deterministic failure after prolonged operation. I can reliably reproduce this issue and can provide full logs, ring dumps, kernel config, or test patches if useful. I can also report that, unlike RC2, which failed and never recovered, RC3, after waiting 30 min, recovered enough for consoles under Gnome to operate normally, and a systemd reboot completed normally, even if very slowly. This is a definite improvement. Best regards, Harris Landgarten
