AMD General

Hi Timur, Alex,

Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for the 
past few weeks and what you're describing on NV48 lines up closely with what 
we've seen

Quick highlights from my work:

1. IH retry CAM ACK doesn't actually free the slot when written via WDOORBELL 
on NV4 .. we have to use MMIO
(WREG32_SOC15(OSSSYS, 0, regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).
I think you may want to check that, since "fault never resolves" is exactly the 
symptom you'd see if the CAM never gets cleared.

2. gfx12 needs its own retry-fault detection path .. 
amdgpu_gmc_handle_retry_fault on gfx9-era constants
(AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on gfx12. We 
added a gfx12-native handler that
reads from src_data[2] for NV4.

3. TLB flush making it worse is a known trap .. on NV4 we see the same. The 
flush adds more pressure on the same UTC L2
 already saturated by the retry storm; the GCR can't drain. We have UMR 
captures showing GCVM_L2 stuck busy on the
user VMID with SDMA parked on a GCR ack.

4. Up to ~512 MiB our patches resolve faults cleanly; at 1 GiB we see random 
hangs that we've isolated to an SDMA ->
GCR -> GC-cache deadlock when the BO-clear runs in ih_soft_work context.

Could you reply with your series? I tried searching the inbox but couldn't find 
it. Once I have it, I can diff against ours to see what overlaps and what's 
net-new on each side.

AMIR SHETAIA
Senior Software Development Engineer  |  AMD
Software Platform Architecture Team
----------------------------------------------------------------------------------------------------------------------------------
1 Commerce Valley Drive, Markham, ON L3T 7X6
LinkedIn  |  Instagram  |  X  |  amd.com




-----Original Message-----
From: Timur Kristóf <[email protected]>
Sent: Wednesday, May 13, 2026 12:43 PM
To: Shetaia, Amir <[email protected]>; Alex Deucher <[email protected]>
Cc: [email protected]; Deucher, Alexander 
<[email protected]>; Koenig, Christian <[email protected]>; 
Marek Olšák <[email protected]>; Natalie Vock <[email protected]>; Melissa Wen 
<[email protected]>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

[You don't often get email from [email protected]. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time Alex Deucher 
wrote:
> + Amir
>
> Amir may have some insights on navi4x as he was looking at this recently.
>
> Alex

Hi Alex, Amir,

I think we are very close to enabling retry faults by default on Navi 3.
I'd be happy to receive feedback on the above series.

With regards to Navi 4:

I also attempted to get it working on Navi 48, and I managed to get retry 
faults enabled, but it seems that amdgpu_vm_handle_fault() can't actually 
resolve the page fault on Navi 48. It just keeps retrying until it times out.
Christian suggested this may be due to an invalid page being stuck in the 
cache. I tried adding a TLB flush but unfortunately that just made it worse (it 
hangs irrecoverably).

Any insight is appreciated!

Thanks & best regards,
Timur

>
> On Wed, May 13, 2026 at 12:30 PM Timur Kristóf
> <[email protected]>
wrote:
> > Fix some issues regarding retry fault handling, such as enabling the
> > retry fault interrupt (necessary for retry faults to work) and such.
> >
> > Improve retry faults on Navi 3 dGPUs by enabling the filter CAM,
> > which can filter the repeated page fault interrupts that happen when
> > retry faults are enabled, making the handling more efficient.
> >
> > With this series, the kernel is able to mitigate most page faults on
> > Navi 3 without causing a hang and without a need to reset the GPU,
> > when the
> > amdgpu.noretry=0 module parameter is set.
> >
> > Timur Kristóf (6):
> >   drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
> >   drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
> >   drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
> >   drm/amdgpu/gmc: Don't compare page fault timestamps with other
> >
> >     interrupts
> >
> >   drm/amdgpu/ih: Add retry_cam_ack IH function pointer
> >   drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++--------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
> >  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
> >  drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
> >  22 files changed, 134 insertions(+), 71 deletions(-)
> >
> > --
> > 2.54.0




Reply via email to