On 12/10/25 04:04, Zhang, Jesse(Jie) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Koenig, Christian <[email protected]>
>> Sent: Tuesday, December 9, 2025 5:42 PM
>> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
>> Cc: Deucher, Alexander <[email protected]>
>> Subject: Re: [PATCH] drm/amdgpu: Wait for eviction fence before scheduling
>> resume work
>>
>> On 12/9/25 10:23, Jesse.Zhang wrote:
>>> In the amdgpu_userq_evict function, after signaling the eviction
>>> fence, we need to ensure it's processed before scheduling the resume
>>> work. This prevents potential race conditions where the resume work
>>> might start before the eviction fence has been fully handled, leading
>>> to inconsistent state in user queues.
>>
>> Well signaling the fence means it is fully processed. So this change here is
>> just
>> bluntly nonsense.
>>
>> What exactly is happening?
> [Zhang, Jesse(Jie)] Hi Christian,
>
> Let me clarify the issue we're observing with the SDMA user queues under
> stress.
>
> **The Problem:**
> During stress testing of SDMA user queues, we intermittently see stale
> doorbell values persisting after the CPU writes to `cpu_wptr`.
> Specifically, after updating `cpu_wptr` (which should update the doorbell),
> the doorbell register sometimes retains its previous value,
> causing inconsistent queue behavior. This happens randomly under heavy load
> but is reproducible in stress scenarios.
>
>
> **Root Cause Analysis:**
> After signaling the eviction fence, the resume work is scheduled immediately
> without ensuring that all internal driver state updates
> (queue state transitions, MES state cleanup, etc.) are fully visible and
> consistent.
That is a massive bug and the root cause of this issue.
The eviction fence can only be signaled *after* all queue state transitions and
the MES state is clean.
What the heck is going on here? What state are we talking about?
> How about changing it this way?
Stuff like that is an absolutely clear NAK as well.
Regards,
Christian.
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> @@ -1130,8 +1130,27 @@ static void amdgpu_userq_restore_worker(struct
> work_struct *work)
> {
> /* Schedule a resume work */
> - schedule_delayed_work(&uq_mgr->resume_work, 0);
> + schedule_delayed_work(&uq_mgr->resume_work, usecs_to_jiffies(1000));
>
> Thanks
> Jesse
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Jesse Zhang <[email protected]>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 4 ++++
>>> 1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index 2f97f35e0af5..ed744b2edc61 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -1238,6 +1238,10 @@ amdgpu_userq_evict(struct amdgpu_userq_mgr
>> *uq_mgr,
>>> return;
>>> }
>>>
>>> + /* Wait for eviction fence to be processed before schedule a resume
>>> work */
>>> + if (dma_fence_wait_timeout(&ev_fence->base, false,
>>> msecs_to_jiffies(100))
>> <= 0) {
>>> + dev_warn(adev->dev, "Eviction fence wait timed out\n");
>>> + }
>>> /* Schedule a resume work */
>>> schedule_delayed_work(&uq_mgr->resume_work, 0); }
>