Le 05/02/2026 à 15:26, Alex Deucher a écrit :
On Thu, Feb 5, 2026 at 9:22 AM Pierre-Eric Pelloux-Prayer
<[email protected]> wrote:



Le 30/01/2026 à 18:30, Alex Deucher a écrit :
We only want to stop the work queues, not mess with the
fences, etc.

v2: add the job back to the pending list.
v3: return the proper job status so scheduler adds the
      job back to the pending list

Signed-off-by: Alex Deucher <[email protected]>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 6 ++----
   2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e69ab8a923e31..a5b43d57c7b05 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6313,7 +6313,7 @@ static void amdgpu_device_halt_activities(struct 
amdgpu_device *adev,
                       if (!amdgpu_ring_sched_ready(ring))
                               continue;

-                     drm_sched_stop(&ring->sched, job ? &job->base : NULL);
+                     drm_sched_wqueue_stop(&ring->sched);

                       if (need_emergency_restart)
                               amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
@@ -6397,7 +6397,7 @@ static int amdgpu_device_sched_resume(struct list_head 
*device_list,
                       if (!amdgpu_ring_sched_ready(ring))
                               continue;

-                     drm_sched_start(&ring->sched, 0);
+                     drm_sched_wqueue_start(&ring->sched);
               }

               if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && 
!job_signaled)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index df06a271bdf99..cd0707737a29b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -92,7 +92,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
       struct drm_wedge_task_info *info = NULL;
       struct amdgpu_task_info *ti = NULL;
       struct amdgpu_device *adev = ring->adev;
-     enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET;
       int idx, r;

       if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
@@ -147,8 +146,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
                               ring->sched.name);
                       drm_dev_wedged_event(adev_to_drm(adev),
                                            DRM_WEDGE_RECOVERY_NONE, info);
-                     /* This is needed to add the job back to the pending list 
*/
-                     status = DRM_GPU_SCHED_STAT_NO_HANG;
                       goto exit;
               }
               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
@@ -184,7 +181,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
   exit:
       amdgpu_vm_put_task_info(ti);
       drm_dev_exit(idx);
-     return status;
+     /* This is needed to add the job back to the pending list */
+     return DRM_GPU_SCHED_STAT_NO_HANG;

This part seems unrelated to the patch and is overwriting what was done
in patch 1/12.

Patch 1 fixes the pending list handling for per queue resets.  This
patch reworks the adapter reset path to match the behavior of the per
queue reset path.  After this patch they match so we can safely return
DRM_GPU_SCHED_STAT_NO_HANG in both cases.  Previously the adapter
reset path called drm_sched_wqueue_stop()/start() which handles
re-adding the job to the pending list.  Since it no longer does, we
need to return DRM_GPU_SCHED_STAT_NO_HANG for both cases.

I looked a bit more in the patchset adding DRM_GPU_SCHED_STAT_NO_HANG and your changes make sense. This patch is:

Acked-by: Pierre-Eric Pelloux-Prayer <[email protected]>

Reply via email to