sched: Allow drivers to skip the reset and keep on running

Maíra Canal Mon, 12 May 2025 07:00:21 -0700

Hi Philipp,

On 12/05/25 08:04, Philipp Stanner wrote:

On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote:

When the DRM scheduler times out, it's possible that the GPU isn't
hung;
instead, a job may still be running, and there may be no valid reason
to
reset the hardware. This can occur in two situations:


   1. The GPU exposes some mechanism that ensures the GPU is still
making
      progress. By checking this mechanism, we can safely skip the
reset,
      rearm the timeout, and allow the job to continue running until
      completion. This is the case for v3d and Etnaviv.


Who is "we" and where is the reset skipped? In the timedout_job()
callback?


"we" is just a generic way to say "we, the driver developers". The reset
is skipped in the `timedout_job()` callback.

   2. TDR has fired before the IRQ that signals the fence.


Any concern about saying "Timeout" instead of "TDR"? I think most of us
aren't familiar with that acronym.


I'll use "Timeout" in v2.

Consequently,
      the job actually finishes, but it triggers a timeout before
signaling
      the completion fence.


That formulation doesn't seem correct. Once the timeout fired, the job,
as far as the GPU is concerned, is already finished, isn't it?


Yes, but there is a race-condition between the IRQ handler (which will
indeed signal the fence) and timeout handler.


What is the "completion fence"? In the scheduler, we call the fence
returned by backend_ops.run_job() the "hardware fence".

> > And who is the "it" in "it triggers a timeout"? I assume you want to

say "the job has actually finished, but the scheduler triggers a
timeout anyways".


I'll address it in v2.


Also the purpose of that list is a bit unclear to me. It seems to be a
list of problems, but point 1 seems valid?


Point 2 is also valid, otherwise Xe, Panfrost, Lima, and Etnaviv
wouldn't have to handle this situation. TDR can "win" the race-condition
and although the hardware fence will be signaled in the IRQ handler, the
job won't be freed.


These two scenarios are problematic because we remove the job from
the
`sched->pending_list` before calling `sched->ops->timedout_job()`.


Who is "we"? :)


Ditto.

This
means that when the job finally signals completion (e.g. in the IRQ
handler),


A job doesn't signal completion.

The hardware / driver signals job completion by signaling the hardware
fence.


I'll address it in v2.

  the scheduler won't call `sched->ops->free_job()`. As a result,
the job and its resources won't be freed, leading to a memory leak.


OK, I think I get it. But isn't another explanation of the issue that
the driver callback doesn't take care of cleaning up the job that has
timed out (from the scheduler's perspective)?

It's not clear to me that the scheduler actually contains a bug here,
but rather is designed in a way that doesn't consider that some GPUs
have special timeout requirements or, rather, can have bursts of
slowness that don't actually indicate a timeout.

I think the commit message should be very clear about whether this is
an improvement of a design weakness or an actual bug fix.


To resolve this issue, we create a new `drm_gpu_sched_stat` that
allows a
driver to skip the reset. This new status will indicate that the job
should be reinserted into the pending list, and the driver will still
signal its completion.


Hmm, yes, I think that this is the right way to address that problem.
+1


Signed-off-by: Maíra Canal <mca...@igalia.com>
---
  drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
  include/drm/gpu_scheduler.h            |  2 ++
  2 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index
829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881
135dbc639a9b4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
work_struct *work)
                        job->sched->ops->free_job(job);
                        sched->free_guilty = false;
                }
+
+               /*
+                * If the driver indicated that the GPU is still
running and wants to skip
+                * the reset, reinsert the job back into the pending
list and realarm the
+                * timeout.
+                */
+               if (status == DRM_GPU_SCHED_STAT_RUNNING) {
+                       spin_lock(&sched->job_list_lock);
+                       list_add(&job->list, &sched->pending_list);
+                       spin_unlock(&sched->job_list_lock);
+               }
        } else {
                spin_unlock(&sched->job_list_lock);
        }
@@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct
work_struct *work)
   * This function is typically used for reset recovery (see the docu
of
   * drm_sched_backend_ops.timedout_job() for details). Do not call it
for
   * scheduler teardown, i.e., before calling drm_sched_fini().
+ *
+ * As it's used for reset recovery, drm_sched_stop() shouldn't be
called
+ * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).


The same comment then applies to the counterpart, drm_sched_start().


Thanks for your review! I'll address all your comments in v2.

Best Regards,
- Maíra

We might also want to look into who uses drm_sched_wqueue_{start,stop}
and consider if they need a comment. Though I don't expect you to do
that. Those functions are hacky legacy anyways.


P.

   */
  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
drm_sched_job *bad)
  {
diff --git a/include/drm/gpu_scheduler.h
b/include/drm/gpu_scheduler.h
index
1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b
927202a507d51 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -389,11 +389,13 @@ struct drm_sched_job {
   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
   * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
anymore.
+ * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the
reset.
   */
  enum drm_gpu_sched_stat {
        DRM_GPU_SCHED_STAT_NONE,
        DRM_GPU_SCHED_STAT_NOMINAL,
        DRM_GPU_SCHED_STAT_ENODEV,
+       DRM_GPU_SCHED_STAT_RUNNING,
  };

/**

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

Reply via email to