Adding folks from the KFD team to take a look. Thank you for bisecting. Does the attached patch fix it?
Thanks, Alex On Wed, Jun 25, 2025 at 12:33 AM Johl Brown <johlbr...@gmail.com> wrote: > > Good Afternoon and best wishes! > This is my first attempt at upstreaming an issue after dailying arch for a > full year now :) > Please forgive me, a lot of this is pushing my comfort zone, but preventing > needless e-waste is important to me personally :) with this in mind, I will > save your eyeballs and let you know I did use gpt to help compile the below, > but I have proofread it several times (which means you can't be mad :p ). > > > https://github.com/ROCm/ROCm/issues/4965 > https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779 > > > Hello Kernel, AMD GPU, & ROCm maintainers, > > TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a number of > kernels since v6.14 and newer. This was not previously the case prior to 6.15 > for ROCm 6.4.0 on gfx803 cards. > > The issue has been successfully mitigated within an older version of ROC > under kernel 6.16rc2 by reverting two specific commits: > > de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19) > > bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx 9.4+”, > 2025-03-06) > > Reverting both commits on top of v6.16-rc3 restores full stability and allows > ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to run. > Instability is usually immediately obvious via eg models failing to > initialise, no errors (other than host dmesg)/segfault reported, which is the > usual failure method under previous kernels. > > ________________________________ > > Problem Description > > A number of users report GPU hangs when initialising compute loads, > specifically with ROCm 5.7+ workloads. This issue appears to be a regression, > as it was not present in earlier kernel versions. > > System Information: > > OS: Arch Linux > > CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz > > GPU: AMD Radeon RX 580 Series (gfx803) > > ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per rocminfo > --support) > > ________________________________ > > Affected Kernels and Regression Details > > The problem consistently occurs on v6.14.1-rc1 and newer kernels. > > Last known good: v6.11 > > First known bad: v6.12 > > The regression has been bisected to the following two commits, as reverting > them resolves the issue: > > de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19) > > bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”, 2025-03-06) > > Both patches touch amdkfd queue reset paths and are first included in the > exact releases where the regression appears. > > Here's a summary of kernel results: > > Kernel | Result | Note > > ------- | -------- | -------- > > 6.13.y (LTS) | OK | > > 6.14.0 | OK | Baseline - my last working kernel, though I am not exactly sure > which subver > > 6.14.1-rc1 | BAD | First hang > > 6.15-rc1 | BAD | Hang > > 6.15.8 | BAD | Hang > > 6.16-rc3 | BAD | Hang > > 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored, ROCm > workloads run for hours. > > ________________________________ > > Reproduction Steps > > Boot the system with a kernel version exhibiting the issue (e.g., v6.14.1-rc1 > or newer without the reverts). > > Run a ROCm workload that creates several compute queues, for example: > > python stable-diffusion.py > > faster-whisper --model medium ... > > Upon model initialization, an immediate driver crash occurs. This is visible > on the host machine via dmesg logs. > > Observed Error Messages (dmesg): > > [drm] scheduler comp_1.1.1 is not ready, skipping > [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout > [message continues ad-infinitum while system functions generally] > > This is followed by a hard GPU reset (visible in logs, no visual artifacts), > which reliably leads to a full system lockup. Python or Docker processes > become unkillable, requiring a manual reboot. Over time, the desktop slowly > loses interactivity. > > ________________________________ > > Bisect Details > > I previously attempted a git bisect (limited to drivers/gpu/drm/amd) between > v6.12 and v6.15-rc1, which identified some further potentially problematic > commits, however due to undersized /boot/ partition was experiencing some > difficulties. In the interim, it seems a user on the gfx803 compatibilty > repo discovered the below regarding ROC 5.7: > > de84484c6f8b07ad0850d6c4 bad > bac38ca057fef2c8c024fe9e bad > > Cherry-picking reverts of both commits on top of v6.16-rc3 restores normal > behavior; leaving either patch in place reproduces the hang. > > ________________________________ > > Relevant Log Excerpts > > (Full dmesg logs can be attached separately if needed) > > [drm] scheduler comp_1.1.1 is not ready, skipping > [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout, signaled > seq=123456 emitted seq=123459 > [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, reset domain > time = 2ms > > ________________________________ > References: > > It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, skipping ... > (https://bbs.archlinux.org/viewtopic.php?id=302729) > > Observations about HSA and KFD backends in TinyGrad · GitHub > (https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48) > > AMD RX580 system freeze on maximum VRAM speed > (https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639) > > LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 > (https://lkml.org/lkml/2025/4/5/394) > > Commits · torvalds/linux - GitHub (Link for commit de84484) > (https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335) > > Commits · torvalds/linux - GitHub (Link for commit bac38ca) > (https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980) > > ROCm-For-RX580/README.md at main - GitHub > (https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md) > > ROCm 4.6.0 for gfx803 - GitHub > (https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779) > > Compatibility matrices — Use ROCm on Radeon GPUs - AMD > (https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html) > > > ________________________________ > > Why this matters > > Although gfx803 is End-of-Life (EOL) for official ROCm support, large user > communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it. > Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/) demonstrate > that ROCm 6.4+ and RX-580 are fully functional on a number of relatively > recent kernels. This regression significantly impacts the usability of these > cards for compute workloads. > > ________________________________ > > Proposed Next Steps > > I suggest the following for further investigation: > > Review the interaction between the new KFD signal-event slow-path and legacy > GPUs that may lack valid event IDs. > > Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca) returns > stale doorbells on gfx803, potentially causing false positives. > > Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is developed. > > Please let me know if you require any further diagnostics or testing. I can > easily rebuild kernels and provide annotated traces. > > Please find my working document: > https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066 > > Thanks for your time! > > Best regards, big love, > > Johl Brown > > johlbr...@gmail.com
From 3012bbbb378083c2af3433eedb9c2c24cbe8395a Mon Sep 17 00:00:00 2001 From: Alex Deucher <alexander.deuc...@amd.com> Date: Wed, 25 Jun 2025 18:15:37 -0400 Subject: [PATCH] drm/amdkfd: add hqd_sdma_get_doorbell callbacks for gfx7/8 These were missed when support was added for other generations. The callbacks are called unconditionally so we need to make sure all generations have them. Fixes: bac38ca8c475 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+") Cc: Jonathan Kim <jonathan....@amd.com> Reported-by: Johl Brown <johlbr...@gmail.com> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c | 8 ++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c | 8 ++++++++ 2 files changed, 16 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c index ca4a6b82817f5..df77558e03ef2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c @@ -561,6 +561,13 @@ static uint32_t read_vmid_from_vmfault_reg(struct amdgpu_device *adev) return REG_GET_FIELD(status, VM_CONTEXT1_PROTECTION_FAULT_STATUS, VMID); } +static uint32_t kgd_hqd_sdma_get_doorbell(struct amdgpu_device *adev, + int engine, int queue) + +{ + return 0; +} + const struct kfd2kgd_calls gfx_v7_kfd2kgd = { .program_sh_mem_settings = kgd_program_sh_mem_settings, .set_pasid_vmid_mapping = kgd_set_pasid_vmid_mapping, @@ -578,4 +585,5 @@ const struct kfd2kgd_calls gfx_v7_kfd2kgd = { .set_scratch_backing_va = set_scratch_backing_va, .set_vm_context_page_table_base = set_vm_context_page_table_base, .read_vmid_from_vmfault_reg = read_vmid_from_vmfault_reg, + .hqd_sdma_get_doorbell = kgd_hqd_sdma_get_doorbell, }; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c index 0f3e2944edd7e..e68c0fa8d7513 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c @@ -582,6 +582,13 @@ static void set_vm_context_page_table_base(struct amdgpu_device *adev, lower_32_bits(page_table_base)); } +static uint32_t kgd_hqd_sdma_get_doorbell(struct amdgpu_device *adev, + int engine, int queue) + +{ + return 0; +} + const struct kfd2kgd_calls gfx_v8_kfd2kgd = { .program_sh_mem_settings = kgd_program_sh_mem_settings, .set_pasid_vmid_mapping = kgd_set_pasid_vmid_mapping, @@ -599,4 +606,5 @@ const struct kfd2kgd_calls gfx_v8_kfd2kgd = { get_atc_vmid_pasid_mapping_info, .set_scratch_backing_va = set_scratch_backing_va, .set_vm_context_page_table_base = set_vm_context_page_table_base, + .hqd_sdma_get_doorbell = kgd_hqd_sdma_get_doorbell, }; -- 2.50.0