I couldn't find a dmesg attched to the linked bug reports. I was going to look for a kernel oops from calling an uninitialized function pointer. Your patch addresses just that.
I'm not sure how “drm/amdkfd: Improve signal event slow path” is implicated. I don't see anything in that patch that would break specifically on gfx v803. Regards, Felix On 2025-06-25 18:21, Alex Deucher wrote: > Adding folks from the KFD team to take a look. Thank you for > bisecting. Does the attached patch fix it? > > Thanks, > > Alex > > On Wed, Jun 25, 2025 at 12:33 AM Johl Brown <johlbr...@gmail.com> wrote: >> Good Afternoon and best wishes! >> This is my first attempt at upstreaming an issue after dailying arch for a >> full year now :) >> Please forgive me, a lot of this is pushing my comfort zone, but preventing >> needless e-waste is important to me personally :) with this in mind, I will >> save your eyeballs and let you know I did use gpt to help compile the below, >> but I have proofread it several times (which means you can't be mad :p ). >> >> >> https://github.com/ROCm/ROCm/issues/4965 >> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779 >> >> >> Hello Kernel, AMD GPU, & ROCm maintainers, >> >> TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a number of >> kernels since v6.14 and newer. This was not previously the case prior to >> 6.15 for ROCm 6.4.0 on gfx803 cards. >> >> The issue has been successfully mitigated within an older version of ROC >> under kernel 6.16rc2 by reverting two specific commits: >> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19) >> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx 9.4+”, >> 2025-03-06) >> >> Reverting both commits on top of v6.16-rc3 restores full stability and >> allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to run. >> Instability is usually immediately obvious via eg models failing to >> initialise, no errors (other than host dmesg)/segfault reported, which is >> the usual failure method under previous kernels. >> >> ________________________________ >> >> Problem Description >> >> A number of users report GPU hangs when initialising compute loads, >> specifically with ROCm 5.7+ workloads. This issue appears to be a >> regression, as it was not present in earlier kernel versions. >> >> System Information: >> >> OS: Arch Linux >> >> CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz >> >> GPU: AMD Radeon RX 580 Series (gfx803) >> >> ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per >> rocminfo --support) >> >> ________________________________ >> >> Affected Kernels and Regression Details >> >> The problem consistently occurs on v6.14.1-rc1 and newer kernels. >> >> Last known good: v6.11 >> >> First known bad: v6.12 >> >> The regression has been bisected to the following two commits, as reverting >> them resolves the issue: >> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19) >> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”, 2025-03-06) >> >> Both patches touch amdkfd queue reset paths and are first included in the >> exact releases where the regression appears. >> >> Here's a summary of kernel results: >> >> Kernel | Result | Note >> >> ------- | -------- | -------- >> >> 6.13.y (LTS) | OK | >> >> 6.14.0 | OK | Baseline - my last working kernel, though I am not exactly >> sure which subver >> >> 6.14.1-rc1 | BAD | First hang >> >> 6.15-rc1 | BAD | Hang >> >> 6.15.8 | BAD | Hang >> >> 6.16-rc3 | BAD | Hang >> >> 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored, ROCm >> workloads run for hours. >> >> ________________________________ >> >> Reproduction Steps >> >> Boot the system with a kernel version exhibiting the issue (e.g., >> v6.14.1-rc1 or newer without the reverts). >> >> Run a ROCm workload that creates several compute queues, for example: >> >> python stable-diffusion.py >> >> faster-whisper --model medium ... >> >> Upon model initialization, an immediate driver crash occurs. This is visible >> on the host machine via dmesg logs. >> >> Observed Error Messages (dmesg): >> >> [drm] scheduler comp_1.1.1 is not ready, skipping >> [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout >> [message continues ad-infinitum while system functions generally] >> >> This is followed by a hard GPU reset (visible in logs, no visual artifacts), >> which reliably leads to a full system lockup. Python or Docker processes >> become unkillable, requiring a manual reboot. Over time, the desktop slowly >> loses interactivity. >> >> ________________________________ >> >> Bisect Details >> >> I previously attempted a git bisect (limited to drivers/gpu/drm/amd) between >> v6.12 and v6.15-rc1, which identified some further potentially problematic >> commits, however due to undersized /boot/ partition was experiencing some >> difficulties. In the interim, it seems a user on the gfx803 compatibilty >> repo discovered the below regarding ROC 5.7: >> >> de84484c6f8b07ad0850d6c4 bad >> bac38ca057fef2c8c024fe9e bad >> >> Cherry-picking reverts of both commits on top of v6.16-rc3 restores normal >> behavior; leaving either patch in place reproduces the hang. >> >> ________________________________ >> >> Relevant Log Excerpts >> >> (Full dmesg logs can be attached separately if needed) >> >> [drm] scheduler comp_1.1.1 is not ready, skipping >> [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout, signaled >> seq=123456 emitted seq=123459 >> [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, reset >> domain time = 2ms >> >> ________________________________ >> References: >> >> It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, skipping ... >> (https://bbs.archlinux.org/viewtopic.php?id=302729) >> >> Observations about HSA and KFD backends in TinyGrad · GitHub >> (https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48) >> >> AMD RX580 system freeze on maximum VRAM speed >> (https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639) >> >> LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 >> (https://lkml.org/lkml/2025/4/5/394) >> >> Commits · torvalds/linux - GitHub (Link for commit de84484) >> (https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335) >> >> Commits · torvalds/linux - GitHub (Link for commit bac38ca) >> (https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980) >> >> ROCm-For-RX580/README.md at main - GitHub >> (https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md) >> >> ROCm 4.6.0 for gfx803 - GitHub >> (https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779) >> >> Compatibility matrices — Use ROCm on Radeon GPUs - AMD >> (https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html) >> >> >> ________________________________ >> >> Why this matters >> >> Although gfx803 is End-of-Life (EOL) for official ROCm support, large user >> communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it. >> Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/) >> demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of >> relatively recent kernels. This regression significantly impacts the >> usability of these cards for compute workloads. >> >> ________________________________ >> >> Proposed Next Steps >> >> I suggest the following for further investigation: >> >> Review the interaction between the new KFD signal-event slow-path and legacy >> GPUs that may lack valid event IDs. >> >> Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca) returns >> stale doorbells on gfx803, potentially causing false positives. >> >> Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is developed. >> >> Please let me know if you require any further diagnostics or testing. I can >> easily rebuild kernels and provide annotated traces. >> >> Please find my working document: >> https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066 >> >> Thanks for your time! >> >> Best regards, big love, >> >> Johl Brown >> >> johlbr...@gmail.com