On Tue, Jan 13, 2026 at 10:57 AM Christian König <[email protected]> wrote: > > On 1/13/26 14:34, Philipp Stanner wrote: > > On Tue, 2026-01-13 at 14:17 +0100, Christian König wrote: > >> On 1/8/26 15:48, Alex Deucher wrote: > >>> We only want to stop the work queues, not mess with the > >>> pending list so just stop the work queues. > > > > Ideally amdgpu could stop touching the pending_list altogether forever, > > as discussed at XDC. Is work for that in the pipe? Is that what this > > patch is for? > > Yes. > > > > >> > >> Oh, yes please! I can't remember how long we have worked towards that. > >> > >> But we also need to change the return code so that the scheduler now > >> re-inserts the job into the pending list. > > > > You're referring to false-positive timeouts. Porting users to that > > typically consists of adding that return code and also removing > > whatever the driver used to do to inject the non-timedout job into the > > scheduler again. > > > > How is that being done here? > > Previously drm_sched_stop() would insert the job back into the pending list > after stopping the scheduler thread. > > But when that is replaced with drm_sched_wqueue_stop() then that won't happen > any more. That is a good thing and prevents us from running into problems > like UAF because the HW fence signaled. > > As far as I can see we should start returning DRM_GPU_SCHED_STAT_NO_HANG from > amdgpu even when there was actually a hang (maybe rename the return code). >
We already return DRM_GPU_SCHED_STAT_NOMINAL unconditionally. The only other option is DRM_GPU_SCHED_STAT_ENODEV which is not correct. As far as I can see, there is nothing else to do. The fence will be signalled after the adapter reset. Alex > Regards, > Christian. > > > > > P. > > > >> > >> Adding Philip on CC to double check what I say above. > >> > >> Regards, > >> Christian. > >> > >>> > >>> Signed-off-by: Alex Deucher <[email protected]> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> index 80572f71ff627..868ab5314c0d1 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> @@ -6301,7 +6301,7 @@ static void amdgpu_device_halt_activities(struct > >>> amdgpu_device *adev, > >>> if (!amdgpu_ring_sched_ready(ring)) > >>> continue; > >>> > >>> - drm_sched_stop(&ring->sched, job ? &job->base : NULL); > >>> + drm_sched_wqueue_stop(&ring->sched); > >>> > >>> if (need_emergency_restart) > >>> > >>> amdgpu_job_stop_all_jobs_on_sched(&ring->sched); > >>> @@ -6385,7 +6385,7 @@ static int amdgpu_device_sched_resume(struct > >>> list_head *device_list, > >>> if (!amdgpu_ring_sched_ready(ring)) > >>> continue; > >>> > >>> - drm_sched_start(&ring->sched, 0); > >>> + drm_sched_wqueue_start(&ring->sched); > >>> } > >>> > >>> if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && > >>> !job_signaled) > >> > > >
