[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Koenig, Christian <[email protected]> > Sent: Wednesday, December 10, 2025 5:53 PM > To: Zhang, Jesse(Jie) <[email protected]>; [email protected] > Cc: Deucher, Alexander <[email protected]> > Subject: Re: [PATCH] drm/amdgpu: Wait for eviction fence before scheduling > resume work > > On 12/10/25 04:04, Zhang, Jesse(Jie) wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> -----Original Message----- > >> From: Koenig, Christian <[email protected]> > >> Sent: Tuesday, December 9, 2025 5:42 PM > >> To: Zhang, Jesse(Jie) <[email protected]>; > >> [email protected] > >> Cc: Deucher, Alexander <[email protected]> > >> Subject: Re: [PATCH] drm/amdgpu: Wait for eviction fence before > >> scheduling resume work > >> > >> On 12/9/25 10:23, Jesse.Zhang wrote: > >>> In the amdgpu_userq_evict function, after signaling the eviction > >>> fence, we need to ensure it's processed before scheduling the resume > >>> work. This prevents potential race conditions where the resume work > >>> might start before the eviction fence has been fully handled, > >>> leading to inconsistent state in user queues. > >> > >> Well signaling the fence means it is fully processed. So this change > >> here is just bluntly nonsense. > >> > >> What exactly is happening? > > [Zhang, Jesse(Jie)] Hi Christian, > > > > Let me clarify the issue we're observing with the SDMA user queues under > stress. > > > > **The Problem:** > > During stress testing of SDMA user queues, we intermittently see stale > > doorbell > values persisting after the CPU writes to `cpu_wptr`. > > Specifically, after updating `cpu_wptr` (which should update the > > doorbell), the doorbell register sometimes retains its previous value, > > causing > inconsistent queue behavior. This happens randomly under heavy load but is > reproducible in stress scenarios. > > > > > > **Root Cause Analysis:** > > After signaling the eviction fence, the resume work is scheduled > > immediately without ensuring that all internal driver state updates (queue > > state > transitions, MES state cleanup, etc.) are fully visible and consistent. > > That is a massive bug and the root cause of this issue. > > The eviction fence can only be signaled *after* all queue state transitions > and the > MES state is clean. > > What the heck is going on here? What state are we talking about? [Zhang, Jesse(Jie)] This issue was discovered during SDMA user queue testing in the IGT. After updating wptr_cpu and the doorbell from CPU side, the doorbell register sometimes randomly retains its previous value. We suspect this is related to the memory synchronization. Do you have any guidance on this?
Here is the internal ticket : https://ontrack-internal.amd.com/browse/SWDEV-565880> Thanks Jesse > > How about changing it this way? > > Stuff like that is an absolutely clear NAK as well. > > Regards, > Christian. > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > > @@ -1130,8 +1130,27 @@ static void amdgpu_userq_restore_worker(struct > > work_struct *work) { > > /* Schedule a resume work */ > > - schedule_delayed_work(&uq_mgr->resume_work, 0); > > + schedule_delayed_work(&uq_mgr->resume_work, > > + usecs_to_jiffies(1000)); > > > > Thanks > > Jesse > > > >> > >> Regards, > >> Christian. > >> > >>> > >>> Signed-off-by: Jesse Zhang <[email protected]> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 4 ++++ > >>> 1 file changed, 4 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > >>> index 2f97f35e0af5..ed744b2edc61 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > >>> @@ -1238,6 +1238,10 @@ amdgpu_userq_evict(struct amdgpu_userq_mgr > >> *uq_mgr, > >>> return; > >>> } > >>> > >>> + /* Wait for eviction fence to be processed before schedule a > >>> resume > work */ > >>> + if (dma_fence_wait_timeout(&ev_fence->base, false, > >>> + msecs_to_jiffies(100)) > >> <= 0) { > >>> + dev_warn(adev->dev, "Eviction fence wait timed out\n"); > >>> + } > >>> /* Schedule a resume work */ > >>> schedule_delayed_work(&uq_mgr->resume_work, 0); } > >
