On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumer...@gmail.com> wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.
>
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
>
> Signed-off-by: Christian König <christian.koe...@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  
>       drm_sched_wqueue_stop(sched);
>  
> +     /*
> +      * Tearing down the scheduler wile there are still unprocessed jobs can
> +      * lead to use after free issues in the scheduler fence.
> +      */
> +     WARN_ON(!list_empty(&sched->pending_list));

drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
would be helpful in multi-GPU systems.

BR,
Jani.

> +
>       for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>               struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>               spin_lock(&rq->lock);
> -             list_for_each_entry(s_entity, &rq->entities, list)
> +             list_for_each_entry(s_entity, &rq->entities, list) {
> +                     /*
> +                      * The justification for this BUG_ON() is that tearing
> +                      * down the scheduler while jobs are pending leaves
> +                      * dma_fences unsignaled. Since we have dependencies
> +                      * from the core memory management to eventually signal
> +                      * dma_fences this can trivially lead to a system wide
> +                      * stop because of a locked up memory management.
> +                      */
> +                     BUG_ON(spsc_queue_count(&s_entity->job_queue));
> +
>                       /*
>                        * Prevents reinsertion and marks job_queue as idle,
>                        * it will removed from rq in drm_sched_entity_fini
>                        * eventually
>                        */
>                       s_entity->stopped = true;
> +             }
>               spin_unlock(&rq->lock);
>               kfree(sched->sched_rq[i]);
>       }

-- 
Jani Nikula, Intel

Reply via email to