amdgpu: Improve retry fault handling

Timur Kristóf Wed, 13 May 2026 10:51:57 -0700

Hi Amir,

Thanks for the quick response!
See my replies below.


On Wednesday, May 13, 2026 7:28:41 PM Central European Summer Time Shetaia, 
> 
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for
> the past few weeks and what you're describing on NV48 lines up closely with
> what we've seen
 
> Quick highlights from my work:
> 
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO
> (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).

I agree. That's my conclusion as well and that's exactly what I'm doing in my 
series for Navi 31, see the following patch:
"drm/amdgpu: Enable retry CAM on Navi 3 dGPUs"

> "fault never resolves" is exactly the symptom you'd see if the
> CAM never gets cleared. 

Not exactly.

When the CAM never gets cleared, the first page fault is still resolved, but 
subsequent page faults (that belong to the same CAM entry) will cause a hang 
because the IRQ handler is not called (because the IRQ is filtered out).

That's not what I see on Navi 48. Instead what I see is that the IRQ is fired 
repeatedly and amdgpu_vm_handle_fault() is called repeatedly, but just doesn't 
resolve the fault.

> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on
> gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4.

Interesting. Could you share what bits you checked on src_data[2]?

The gfx9-era constants worked for me on both Navi 31 and 48 for detecting 
retry faults; however I needed to program some extra register fields in the 
gfxhub code to actually enable retry fault interrupts.

> 
> 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The
> flush adds more pressure on the same UTC L2
> already saturated by the retry
> storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck busy
> on the user VMID with SDMA parked on a GCR ack.

I am pretty sure this is what I saw.
Do you have any clue about what can be done about this?

> 4. Up to ~512 MiB our patches resolve faults cleanly;

That's pretty impressive! Nice work!

> at 1 GiB we see random
> hangs that we've isolated to an SDMA ->
> GCR -> GC-cache deadlock when the
> BO-clear runs in ih_soft_work context. 

Actually something I forgot to ask: on Navi 4x is it possible to use the IH1 
ring? On my machine it seemed that the retry fault interrupts always come in 
on the IH0 ring even though the IH1 is enabled and configured upstream already.

> Could you reply with your series? I tried searching the inbox but couldn't
> find it. Once I have it, I can diff against ours to see what overlaps and
> what's net-new on each side.
 
You can view it on patchwork or the mailing list arcives:
https://patchwork.freedesktop.org/series/166522/
https://lists.freedesktop.org/archives/amd-gfx/2026-May/thread.html#144500

Or if that's more comfortable for you, here is my GitLab branch:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults

Thanks & best regards,
Timur

Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

Reply via email to