AMD General > -----Original Message----- > From: Shetaia, Amir <[email protected]> > Sent: Wednesday, May 13, 2026 1:29 PM > To: Timur Kristóf <[email protected]>; Alex Deucher > <[email protected]> > Cc: [email protected]; Deucher, Alexander > <[email protected]>; Koenig, Christian > <[email protected]>; Marek Olšák <[email protected]>; Natalie > Vock <[email protected]>; Melissa Wen <[email protected]> > Subject: RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling > > AMD General > > Hi Timur, Alex, > > Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for > the past few weeks and what you're describing on NV48 lines up closely with > what we've seen > > Quick highlights from my work: > > 1. IH retry CAM ACK doesn't actually free the slot when written via > WDOORBELL on NV4 .. we have to use MMIO (WREG32_SOC15(OSSSYS, 0, > regIH_RETRY_CAM_ACK, cam_index & 0x3ff)). > I think you may want to check that, since "fault never resolves" is exactly > the > symptom you'd see if the CAM never gets cleared. > > 2. gfx12 needs its own retry-fault detection path .. > amdgpu_gmc_handle_retry_fault on gfx9-era constants > (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never > matches on gfx12. We added a gfx12-native handler that reads from > src_data[2] for NV4. > > 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The > flush adds more pressure on the same UTC L2 already saturated by the retry > storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck > busy on the user VMID with SDMA parked on a GCR ack. > > 4. Up to ~512 MiB our patches resolve faults cleanly; at 1 GiB we see random > hangs that we've isolated to an SDMA -> GCR -> GC-cache deadlock when the > BO-clear runs in ih_soft_work context. > > Could you reply with your series? I tried searching the inbox but couldn't > find > it. Once I have it, I can diff against ours to see what overlaps and what's > net- > new on each side. >
Here's the patch series: https://patchwork.freedesktop.org/series/166522/ Alex > AMIR SHETAIA > Senior Software Development Engineer | AMD Software Platform > Architecture Team > ---------------------------------------------------------------------------------------------- > ------------------------------------ > 1 Commerce Valley Drive, Markham, ON L3T 7X6 LinkedIn | Instagram | X | > amd.com > > > > > -----Original Message----- > From: Timur Kristóf <[email protected]> > Sent: Wednesday, May 13, 2026 12:43 PM > To: Shetaia, Amir <[email protected]>; Alex Deucher > <[email protected]> > Cc: [email protected]; Deucher, Alexander > <[email protected]>; Koenig, Christian > <[email protected]>; Marek Olšák <[email protected]>; Natalie > Vock <[email protected]>; Melissa Wen <[email protected]> > Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling > > [You don't often get email from [email protected]. Learn why this is > important at https://aka.ms/LearnAboutSenderIdentification ] > > On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time > Alex Deucher wrote: > > + Amir > > > > Amir may have some insights on navi4x as he was looking at this recently. > > > > Alex > > Hi Alex, Amir, > > I think we are very close to enabling retry faults by default on Navi 3. > I'd be happy to receive feedback on the above series. > > With regards to Navi 4: > > I also attempted to get it working on Navi 48, and I managed to get retry > faults > enabled, but it seems that amdgpu_vm_handle_fault() can't actually resolve > the page fault on Navi 48. It just keeps retrying until it times out. > Christian suggested this may be due to an invalid page being stuck in the > cache. I tried adding a TLB flush but unfortunately that just made it worse > (it > hangs irrecoverably). > > Any insight is appreciated! > > Thanks & best regards, > Timur > > > > > On Wed, May 13, 2026 at 12:30 PM Timur Kristóf > > <[email protected]> > wrote: > > > Fix some issues regarding retry fault handling, such as enabling the > > > retry fault interrupt (necessary for retry faults to work) and such. > > > > > > Improve retry faults on Navi 3 dGPUs by enabling the filter CAM, > > > which can filter the repeated page fault interrupts that happen when > > > retry faults are enabled, making the handling more efficient. > > > > > > With this series, the kernel is able to mitigate most page faults on > > > Navi 3 without causing a hang and without a need to reset the GPU, > > > when the > > > amdgpu.noretry=0 module parameter is set. > > > > > > Timur Kristóf (6): > > > drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly > > > drm/amdgpu/gfxhub: Enable retry fault interrupts when needed > > > drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed > > > drm/amdgpu/gmc: Don't compare page fault timestamps with other > > > > > > interrupts > > > > > > drm/amdgpu/ih: Add retry_cam_ack IH function pointer > > > drm/amdgpu: Enable retry CAM on Navi 3 dGPUs > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 +++++-- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 + > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 1 + > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++----- > -- > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c | 17 ++++++++++------- > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c | 19 +++++++++++----- > --- > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 15 +++++++++------ > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c | 15 +++++++++------ > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 15 +++++++++------ > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 15 +++++++++------ > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c | 17 ++++++++++------- > > > drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c | 17 ++++++++++------ > - > > > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 5 ++++- > > > drivers/gpu/drm/amd/amdgpu/ih_v6_0.c | 18 +++++++++++++++++- > > > drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 6 ++++++ > > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 8 +++++++- > > > 22 files changed, 134 insertions(+), 71 deletions(-) > > > > > > -- > > > 2.54.0 > > > >
