[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Koenig, Christian <[email protected]>
> Sent: Tuesday, December 9, 2025 5:42 PM
> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
> Cc: Deucher, Alexander <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Wait for eviction fence before scheduling
> resume work
>
> On 12/9/25 10:23, Jesse.Zhang wrote:
> > In the amdgpu_userq_evict function, after signaling the eviction
> > fence, we need to ensure it's processed before scheduling the resume
> > work. This prevents potential race conditions where the resume work
> > might start before the eviction fence has been fully handled, leading
> > to inconsistent state in user queues.
>
> Well signaling the fence means it is fully processed. So this change here is
> just
> bluntly nonsense.
>
> What exactly is happening?
[Zhang, Jesse(Jie)] Hi Christian,
Let me clarify the issue we're observing with the SDMA user queues under stress.
**The Problem:**
During stress testing of SDMA user queues, we intermittently see stale doorbell
values persisting after the CPU writes to `cpu_wptr`.
Specifically, after updating `cpu_wptr` (which should update the doorbell), the
doorbell register sometimes retains its previous value,
causing inconsistent queue behavior. This happens randomly under heavy load but
is reproducible in stress scenarios.
**Root Cause Analysis:**
After signaling the eviction fence, the resume work is scheduled immediately
without ensuring that all internal driver state updates
(queue state transitions, MES state cleanup, etc.) are fully visible and
consistent.
How about changing it this way?
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
@@ -1130,8 +1130,27 @@ static void amdgpu_userq_restore_worker(struct
work_struct *work)
{
/* Schedule a resume work */
- schedule_delayed_work(&uq_mgr->resume_work, 0);
+ schedule_delayed_work(&uq_mgr->resume_work, usecs_to_jiffies(1000));
Thanks
Jesse
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Jesse Zhang <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > index 2f97f35e0af5..ed744b2edc61 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > @@ -1238,6 +1238,10 @@ amdgpu_userq_evict(struct amdgpu_userq_mgr
> *uq_mgr,
> > return;
> > }
> >
> > + /* Wait for eviction fence to be processed before schedule a resume
> > work */
> > + if (dma_fence_wait_timeout(&ev_fence->base, false,
> > msecs_to_jiffies(100))
> <= 0) {
> > + dev_warn(adev->dev, "Eviction fence wait timed out\n");
> > + }
> > /* Schedule a resume work */
> > schedule_delayed_work(&uq_mgr->resume_work, 0); }