Good Afternoon and best wishes! This is my first attempt at upstreaming an issue after dailying arch for a full year now :) Please forgive me, a lot of this is pushing my comfort zone, but preventing needless e-waste is important to me personally :) with this in mind, I will save your eyeballs and let you know I did use gpt to help compile the below, but I have proofread it several times (which means you can't be mad :p ).
https://github.com/ROCm/ROCm/issues/4965 https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779 Hello Kernel, AMD GPU, & ROCm maintainers, TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a number of kernels since v6.14 and newer. This was not previously the case prior to 6.15 for ROCm 6.4.0 on gfx803 cards. The issue has been successfully mitigated within an older version of ROC under kernel 6.16rc2 by reverting two specific commits: - de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19) - bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx 9.4+”, 2025-03-06) Reverting both commits on top of v6.16-rc3 restores full stability and allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to run. Instability is usually immediately obvious via eg models failing to initialise, no errors (other than host dmesg)/segfault reported, which is the usual failure method under previous kernels. ------------------------------ Problem Description A number of users report GPU hangs when initialising compute loads, specifically with ROCm 5.7+ workloads. This issue appears to be a regression, as it was not present in earlier kernel versions. System Information: - OS: Arch Linux - CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz - GPU: AMD Radeon RX 580 Series (gfx803) - ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per rocminfo --support) ------------------------------ Affected Kernels and Regression Details The problem consistently occurs on v6.14.1-rc1 and newer kernels. - Last known good: v6.11 - First known bad: v6.12 The regression has been bisected to the following two commits, as reverting them resolves the issue: - de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19) - bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”, 2025-03-06) Both patches touch amdkfd queue reset paths and are first included in the exact releases where the regression appears. Here's a summary of kernel results: Kernel | Result | Note ------- | -------- | -------- 6.13.y (LTS) | OK | 6.14.0 | OK | Baseline - my last working kernel, though I am not exactly sure which subver 6.14.1-rc1 | BAD | First hang 6.15-rc1 | BAD | Hang 6.15.8 | BAD | Hang 6.16-rc3 | BAD | Hang 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored, ROCm workloads run for hours. ------------------------------ Reproduction Steps 1. Boot the system with a kernel version exhibiting the issue (e.g., v6.14.1-rc1 or newer without the reverts). 2. Run a ROCm workload that creates several compute queues, for example: - python stable-diffusion.py - faster-whisper --model medium ... 3. Upon model initialization, an immediate driver crash occurs. This is visible on the host machine via dmesg logs. Observed Error Messages (dmesg): [drm] scheduler comp_1.1.1 is not ready, skipping [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout [message continues ad-infinitum while system functions generally] This is followed by a hard GPU reset (visible in logs, no visual artifacts), which reliably leads to a full system lockup. Python or Docker processes become unkillable, requiring a manual reboot. Over time, the desktop slowly loses interactivity. ------------------------------ Bisect Details I previously attempted a git bisect (limited to drivers/gpu/drm/amd) between v6.12 and v6.15-rc1, which identified some further potentially problematic commits, however due to undersized /boot/ partition was experiencing some difficulties. In the interim, it seems a user on <https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779> the gfx803 compatibilty repo discovered the below regarding ROC 5.7: de84484c6f8b07ad0850d6c4 bad bac38ca057fef2c8c024fe9e bad Cherry-picking reverts of both commits on top of v6.16-rc3 restores normal behavior; leaving either patch in place reproduces the hang. ------------------------------ Relevant Log Excerpts (Full dmesg logs can be attached separately if needed) [drm] scheduler comp_1.1.1 is not ready, skipping [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=123456 emitted seq=123459 [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, reset domain time = 2ms ------------------------------ References: - It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, skipping ... (https://bbs.archlinux.org/viewtopic.php?id=302729) - Observations about HSA and KFD backends in TinyGrad · GitHub ( https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48) - AMD RX580 system freeze on maximum VRAM speed ( https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639 ) - LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 ( https://lkml.org/lkml/2025/4/5/394 <https://www.google.com/search?q=https://lkml.org/lkml/2025/4/5/394>) - Commits · torvalds/linux - GitHub (Link for commit de84484) ( https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335 <https://www.google.com/search?q=https://github.com/torvalds/linux/commits%3Fbefore%3D805ba04cb7ccfc7d72e834ebd796e043142156ba%2B6335> ) - Commits · torvalds/linux - GitHub (Link for commit bac38ca) ( https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980 <https://www.google.com/search?q=https://github.com/torvalds/linux/commits%3Fbefore%3D5bc1018675ec28a8a60d83b378d8c3991faa5a27%2B7980> ) - ROCm-For-RX580/README.md at main - GitHub ( https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md) - ROCm 4.6.0 for gfx803 - GitHub ( https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779 ) - Compatibility matrices — Use ROCm on Radeon GPUs - AMD ( https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html ) ------------------------------ Why this matters Although gfx803 is End-of-Life (EOL) for official ROCm support, large user communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it. Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/) demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of relatively recent kernels. This regression significantly impacts the usability of these cards for compute workloads. ------------------------------ Proposed Next Steps I suggest the following for further investigation: - Review the interaction between the new KFD signal-event slow-path and legacy GPUs that may lack valid event IDs. - Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca) returns stale doorbells on gfx803, potentially causing false positives. - Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is developed. Please let me know if you require any further diagnostics or testing. I can easily rebuild kernels and provide annotated traces. Please find my working document: https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066 Thanks for your time! Best regards, big love, Johl Brown johlbr...@gmail.com