On Mon, 2025-06-16 at 10:27 +0100, Tvrtko Ursulin wrote: > > On 12/06/2025 15:20, Philipp Stanner wrote: > > On Thu, 2025-06-12 at 15:17 +0100, Tvrtko Ursulin wrote: > > > > > > On 03/06/2025 10:31, Philipp Stanner wrote: > > > > Since its inception, the GPU scheduler can leak memory if the > > > > driver > > > > calls drm_sched_fini() while there are still jobs in flight. > > > > > > > > The simplest way to solve this in a backwards compatible manner > > > > is > > > > by > > > > adding a new callback, drm_sched_backend_ops.cancel_job(), > > > > which > > > > instructs the driver to signal the hardware fence associated > > > > with > > > > the > > > > job. Afterwards, the scheduler can savely use the established > > > > free_job() > > > > callback for freeing the job. > > > > > > > > Implement the new backend_ops callback cancel_job(). > > > > > > > > Suggested-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com> > > > > > > Please just add the link to the patch here (it is only in the > > > cover > > > letter): > > > > > > Link: > > > https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursu...@igalia.com/ > > > > That I can do, sure > > Cool, with that, for this patch: > > Acked-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com> > > > > And you probably want to take the unit test modifications from > > > the > > > same > > > patch too. You could put them in the same patch or separate. > > > > Necessary adjustments for the unit tests are already implemented > > and > > are waiting for review separately, since this can be done > > independently > > from this entire series: > > > > https://lore.kernel.org/dri-devel/20250605134154.191764-2-pha...@kernel.org/ > > For me it would make most sense to fold that into 2/6 from this > series. > I don't see it making sense as standalone. So if you could repost the > series with it integrated I will give it a spin and can review that > patch at least.
It does make sense as an independent patch, because it is: independent. It improves the unit tests in a way that they become a better role model for the driver callbacks. All fences always must get signaled, which is not the case there currently. Unit tests serve as a reference implementation for new users, which is why I am stressing that point. If you disagree with that patch's content, please answer on it P. > > Regards, > > Tvrtko > > > > > Thx > > P. > > > > > > > > Regards, > > > > > > Tvrtko > > > > > > > Signed-off-by: Philipp Stanner <pha...@kernel.org> > > > > --- > > > > drivers/gpu/drm/scheduler/sched_main.c | 34 > > > > ++++++++++++++++----- > > > > ----- > > > > include/drm/gpu_scheduler.h | 9 +++++++ > > > > 2 files changed, 30 insertions(+), 13 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > index d20726d7adf0..3f14f1e151fa 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -1352,6 +1352,18 @@ int drm_sched_init(struct > > > > drm_gpu_scheduler > > > > *sched, const struct drm_sched_init_ > > > > } > > > > EXPORT_SYMBOL(drm_sched_init); > > > > > > > > +static void drm_sched_kill_remaining_jobs(struct > > > > drm_gpu_scheduler > > > > *sched) > > > > +{ > > > > + struct drm_sched_job *job, *tmp; > > > > + > > > > + /* All other accessors are stopped. No locking > > > > necessary. > > > > */ > > > > + list_for_each_entry_safe_reverse(job, tmp, &sched- > > > > > pending_list, list) { > > > > + sched->ops->cancel_job(job); > > > > + list_del(&job->list); > > > > + sched->ops->free_job(job); > > > > + } > > > > +} > > > > + > > > > /** > > > > * drm_sched_fini - Destroy a gpu scheduler > > > > * > > > > @@ -1359,19 +1371,11 @@ EXPORT_SYMBOL(drm_sched_init); > > > > * > > > > * Tears down and cleans up the scheduler. > > > > * > > > > - * This stops submission of new jobs to the hardware through > > > > - * drm_sched_backend_ops.run_job(). Consequently, > > > > drm_sched_backend_ops.free_job() > > > > - * will not be called for all jobs still in > > > > drm_gpu_scheduler.pending_list. > > > > - * There is no solution for this currently. Thus, it is up to > > > > the > > > > driver to make > > > > - * sure that: > > > > - * > > > > - * a) drm_sched_fini() is only called after for all submitted > > > > jobs > > > > - * drm_sched_backend_ops.free_job() has been called or > > > > that > > > > - * b) the jobs for which drm_sched_backend_ops.free_job() has > > > > not > > > > been called > > > > - * after drm_sched_fini() ran are freed manually. > > > > - * > > > > - * FIXME: Take care of the above problem and prevent this > > > > function > > > > from leaking > > > > - * the jobs in drm_gpu_scheduler.pending_list under any > > > > circumstances. > > > > + * This stops submission of new jobs to the hardware through > > > > &struct > > > > + * drm_sched_backend_ops.run_job. If &struct > > > > drm_sched_backend_ops.cancel_job > > > > + * is implemented, all jobs will be canceled through it and > > > > afterwards cleaned > > > > + * up through &struct drm_sched_backend_ops.free_job. If > > > > cancel_job is not > > > > + * implemented, memory could leak. > > > > */ > > > > void drm_sched_fini(struct drm_gpu_scheduler *sched) > > > > { > > > > @@ -1401,6 +1405,10 @@ void drm_sched_fini(struct > > > > drm_gpu_scheduler > > > > *sched) > > > > /* Confirm no work left behind accessing device > > > > structures > > > > */ > > > > cancel_delayed_work_sync(&sched->work_tdr); > > > > > > > > + /* Avoid memory leaks if supported by the driver. */ > > > > + if (sched->ops->cancel_job) > > > > + drm_sched_kill_remaining_jobs(sched); > > > > + > > > > if (sched->own_submit_wq) > > > > destroy_workqueue(sched->submit_wq); > > > > sched->ready = false; > > > > diff --git a/include/drm/gpu_scheduler.h > > > > b/include/drm/gpu_scheduler.h > > > > index e62a7214e052..81dcbfc8c223 100644 > > > > --- a/include/drm/gpu_scheduler.h > > > > +++ b/include/drm/gpu_scheduler.h > > > > @@ -512,6 +512,15 @@ struct drm_sched_backend_ops { > > > > * and it's time to clean it up. > > > > */ > > > > void (*free_job)(struct drm_sched_job *sched_job); > > > > + > > > > + /** > > > > + * @cancel_job: Used by the scheduler to guarantee > > > > remaining jobs' fences > > > > + * get signaled in drm_sched_fini(). > > > > + * > > > > + * Drivers need to signal the passed job's hardware > > > > fence > > > > with > > > > + * -ECANCELED in this callback. They must not free the > > > > job. > > > > + */ > > > > + void (*cancel_job)(struct drm_sched_job *sched_job); > > > > }; > > > > > > > > /** > > > > > >