On Tue, Jan 13, 2026 at 10:57 AM Christian König
<[email protected]> wrote:
>
> On 1/13/26 14:34, Philipp Stanner wrote:
> > On Tue, 2026-01-13 at 14:17 +0100, Christian König wrote:
> >> On 1/8/26 15:48, Alex Deucher wrote:
> >>> We only want to stop the work queues, not mess with the
> >>> pending list so just stop the work queues.
> >
> > Ideally amdgpu could stop touching the pending_list altogether forever,
> > as discussed at XDC. Is work for that in the pipe? Is that what this
> > patch is for?
>
> Yes.
>
> >
> >>
> >> Oh, yes please! I can't remember how long we have worked towards that.
> >>
> >> But we also need to change the return code so that the scheduler now 
> >> re-inserts the job into the pending list.
> >
> > You're referring to false-positive timeouts. Porting users to that
> > typically consists of adding that return code and also removing
> > whatever the driver used to do to inject the non-timedout job into the
> > scheduler again.
> >
> > How is that being done here?
>
> Previously drm_sched_stop() would insert the job back into the pending list 
> after stopping the scheduler thread.
>
> But when that is replaced with drm_sched_wqueue_stop() then that won't happen 
> any more. That is a good thing and prevents us from running into problems 
> like UAF because the HW fence signaled.
>
> As far as I can see we should start returning DRM_GPU_SCHED_STAT_NO_HANG from 
> amdgpu even when there was actually a hang (maybe rename the return code).
>

We already return DRM_GPU_SCHED_STAT_NOMINAL unconditionally.  The
only other option is DRM_GPU_SCHED_STAT_ENODEV which is not correct.

As far as I can see, there is nothing else to do.  The fence will be
signalled after the adapter reset.

Alex

> Regards,
> Christian.
>
> >
> > P.
> >
> >>
> >> Adding Philip on CC to double check what I say above.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Signed-off-by: Alex Deucher <[email protected]>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 80572f71ff627..868ab5314c0d1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -6301,7 +6301,7 @@ static void amdgpu_device_halt_activities(struct 
> >>> amdgpu_device *adev,
> >>>                     if (!amdgpu_ring_sched_ready(ring))
> >>>                             continue;
> >>>
> >>> -                   drm_sched_stop(&ring->sched, job ? &job->base : NULL);
> >>> +                   drm_sched_wqueue_stop(&ring->sched);
> >>>
> >>>                     if (need_emergency_restart)
> >>>                             
> >>> amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
> >>> @@ -6385,7 +6385,7 @@ static int amdgpu_device_sched_resume(struct 
> >>> list_head *device_list,
> >>>                     if (!amdgpu_ring_sched_ready(ring))
> >>>                             continue;
> >>>
> >>> -                   drm_sched_start(&ring->sched, 0);
> >>> +                   drm_sched_wqueue_start(&ring->sched);
> >>>             }
> >>>
> >>>             if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && 
> >>> !job_signaled)
> >>
> >
>

Reply via email to