Hello AMD GPU team,

I am reporting a reproducible issue on Strix Point (gfx1150, Radeon 890M)
where the MES scheduler (Ring 13) wedges under sustained compute load.
Once this occurs, the GPU becomes unrecoverable from userspace, and the
system becomes unable to shut down cleanly. This happens reliably during
extended machine-learning training workloads.

Hardware / software environment:
--------------------------------
• GPU: Strix [Radeon 880M / 890M], PCI ID 1002:150e  
• CPU: AMD Ryzen AI 9 HX 370  
• OS: Gentoo Linux (fully updated)  
• Kernel: 6.18.0-gentoo-x86_64  
• linux-firmware version: linux-firmware-20251125_p20251203  
• GPU IFWI: 113-STRIXEMU-001, version 00107777  

Important notes about firmware authenticity:
--------------------------------------------
• The GPU firmware in the linux-firmware package is **unmodified AMD-
  provided firmware**.  
• Gentoo does *not* patch amdgpu firmware—if it loaded successfully, it is
  the exact AMD-signed blob.  
• All GPU firmware components load with valid signatures.  
• There is a separate Gentoo CPU “amd-ucode” signature warning, but that is
  unrelated; GPU firmware loads cleanly.  
• I am therefore already running the *exact* firmware AMD intends for
  gfx1150. If a recent microcode or kernel fix was supposed to address
  MES issues, this confirms the problem persists.

Description of the failure:
---------------------------
After many hours of sustained GPU compute (deep-learning training),
The driver begins logging Ring 13 / MES errors. Once the first real-time
Scheduler message failure appears, the GPU quickly becomes unresponsive:

• amdgpu stops accepting messages  
• Ring 13 reports "MES buffer full" and/or stops progressing  
• display session dies and triggers emergency logout  
• attempting to restart GDM usually causes an immediate reboot  
• The system cannot complete a clean shutdown once MES is wedged  

Critically, `amd-smi reset --gpureset -g 0` *almost* works—it resets GFX
momentarily, GNOME becomes responsive for ~30 seconds, but the GPU then
forces a power-off shortly afterward. This suggests the MES block remains
in a corrupted state and cannot be reinitialized safely.

This behavior is consistent and fully reproducible under heavy compute.

`amd-smi` diagnostic data:
--------------------------
The following `amd-smi` outputs were captured **while training was running**
(i.e., during a real workload, not idle). These represent true runtime
conditions rather than idle reporting.

Examples:

`amd-smi static -g all`:
  • correctly reports gfx1150, 16 CUs, 4096 MB VRAM, clocks, caches, etc.
  • IFWI package 113-STRIXEMU-001 v00107777 is present
  • firmware components (CP_PFP, CP_ME, MEC1, RLC, SDMA, VCN, ASD, PM)
    all load with valid versions

`amd-smi metric -g all`:
  • reports training-phase VRAM usage (approx. 2.4 GB of 4 GB)
  • temperatures are normal
  • most advanced metrics for this iGPU are "N/A"
  • system fully responsive before the wedge

`amd-smi list -g all`:
  • GPU 0 at BDF 0000:c5:00.0 with UUID 00ff150e-0000-1000-8000-000000000000

Impact:
-------
Once Ring 13 wedges, the GPU cannot:
• process new compute work  
• process display work reliably  
• accept amdgpu reset  
• shut the system down cleanly  

This appears to be a MES firmware or driver scheduling issue specific to
gfx1150 under long-duration compute loads.

What I can provide:
-------------------
• Full logs including `dmesg`, journal slices, and fence dump snapshots  
• Exact reproduction steps (training runs 10–20 hours, always reproduces)  
• Willingness to test:
    – patched kernels
    – updated MES firmware
    – new linux-firmware bundles
    – debug patches or instrumentation  
• Any additional diagnostic commands you require  

Please let me know what additional information would be most useful.

Thank you for your time, and I’m happy to assist further with debugging
gfx1150 stability under sustained compute load.

https://bugs.gentoo.org/967078 references this issue with attachments

Harris Landgarten
516 643-1286

Reply via email to