Hello,

I am reporting a reproducible GPU hang on Linux 6.19-rc3 involving the amdgpu 
MES control path, which after a long-running compute workload escalates into 
global fence starvation across multiple GPU rings.

### Hardware/software

* Platform: AZW SER9 (Ryzen AI 9 HX 370)
* GPU: Radeon 880M / 890M (gfx1150, Strix)
* BIOS: SER9T304 (11/13/2025)
* Kernel: Linux 6.19.0-rc3 (vanilla git, no downstream patches)
* Firmware: linux-firmware current as of Dec 2025
* Desktop: GNOME Wayland
* amdgpu: MES enabled (default)

### Workload

* Sustained GPU compute workload (training loop)
* Runtime before failure: ~13 hours
* No suspend/resume, power gating, or thermal events
* System otherwise idle

### Failure sequence

After approximately 13 hours, the following sequence occurs:

1. **MES control-path failure**

```
amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
amdgpu: failed to reg_write_reg_wait
```

2. **MES ring buffer saturation**

```
amdgpu: MES ring buffer is full.
```

This message repeats continuously.

3. **Global fence starvation**
   Shortly after the MES ring fills, fence fallback timers begin expiring on 
multiple rings:

```
amdgpu: Fence fallback timer expired on ring vcn_unified_0
amdgpu: Fence fallback timer expired on ring sdma0
amdgpu: Fence fallback timer expired on ring gfx_0.0.0
amdgpu: Fence fallback timer expired on ring comp_1.3.0
```

These messages persist until reboot.

### Resulting behavior

* GPU compute and graphics are effectively deadlocked
* No forward progress on any affected ring
* GNOME becomes unresponsive
* SSH and local console remain functional
* amdgpu reset does not recover the device
* Full reboot is required

### Interpretation

This appears to be a long-run livelock or resource exhaustion in the 
scheduling/control path:

* MES stops responding to control messages
* The MES ring buffer fills and never drains
* Dependent rings stop retiring fences
* Fence fallback timers expire, indicating loss of forward progress
* No successful recovery path is triggered

This does not appear to be a client misuse or transient reset issue, but rather 
a deterministic failure after prolonged operation.

I can reliably reproduce this issue and can provide full logs, ring dumps, 
kernel config, or test patches if useful.

I can also report that, unlike RC2, which failed and never recovered, RC3, 
after waiting 30 min, recovered enough for consoles under Gnome to operate 
normally, and a systemd reboot completed normally, even if very slowly.
This is a definite improvement.

Best regards,
Harris Landgarten

Reply via email to