Re: [PATCH] drm/amdgpu: Wait for eviction fence before scheduling resume work

Christian König Wed, 10 Dec 2025 01:52:55 -0800

On 12/10/25 04:04, Zhang, Jesse(Jie) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
>> -----Original Message-----
>> From: Koenig, Christian <[email protected]>
>> Sent: Tuesday, December 9, 2025 5:42 PM
>> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
>> Cc: Deucher, Alexander <[email protected]>
>> Subject: Re: [PATCH] drm/amdgpu: Wait for eviction fence before scheduling
>> resume work
>>
>> On 12/9/25 10:23, Jesse.Zhang wrote:
>>> In the amdgpu_userq_evict function, after signaling the eviction
>>> fence, we need to ensure it's processed before scheduling the resume
>>> work. This prevents potential race conditions where the resume work
>>> might start before the eviction fence has been fully handled, leading
>>> to inconsistent state in user queues.
>>
>> Well signaling the fence means it is fully processed. So this change here is 
>> just
>> bluntly nonsense.
>>
>> What exactly is happening?
> [Zhang, Jesse(Jie)] Hi Christian,
> 
> Let me clarify the issue we're observing with the SDMA user queues under 
> stress.
> 
> **The Problem:**
> During stress testing of SDMA user queues, we intermittently see stale 
> doorbell values persisting after the CPU writes to `cpu_wptr`.
> Specifically, after updating `cpu_wptr` (which should update the doorbell), 
> the doorbell register sometimes retains its previous value,
> causing inconsistent queue behavior. This happens randomly under heavy load 
> but is reproducible in stress scenarios.
> 
> 
> **Root Cause Analysis:**
> After signaling the eviction fence, the resume work is scheduled immediately 
> without ensuring that all internal driver state updates
> (queue state transitions, MES state cleanup, etc.) are fully visible and 
> consistent.


That is a massive bug and the root cause of this issue.

The eviction fence can only be signaled *after* all queue state transitions and 
the MES state is clean.

What the heck is going on here? What state are we talking about?

> How about changing it this way?

Stuff like that is an absolutely clear NAK as well.

Regards,
Christian.

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> @@ -1130,8 +1130,27 @@ static void amdgpu_userq_restore_worker(struct 
> work_struct *work)
>  {
>         /* Schedule a resume work */
> -       schedule_delayed_work(&uq_mgr->resume_work, 0);
> +       schedule_delayed_work(&uq_mgr->resume_work, usecs_to_jiffies(1000));
> 
> Thanks
> Jesse
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Jesse Zhang <[email protected]>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index 2f97f35e0af5..ed744b2edc61 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -1238,6 +1238,10 @@ amdgpu_userq_evict(struct amdgpu_userq_mgr
>> *uq_mgr,
>>>             return;
>>>     }
>>>
>>> +       /* Wait for eviction fence to be processed before schedule a resume 
>>> work */
>>> +   if (dma_fence_wait_timeout(&ev_fence->base, false, 
>>> msecs_to_jiffies(100))
>> <= 0) {
>>> +           dev_warn(adev->dev, "Eviction fence wait timed out\n");
>>> +   }
>>>     /* Schedule a resume work */
>>>     schedule_delayed_work(&uq_mgr->resume_work, 0);  }
>

Re: [PATCH] drm/amdgpu: Wait for eviction fence before scheduling resume work

Reply via email to