On 13.08.25 10:56, Philipp Stanner wrote: > In drm_sched_fini() all entities are marked as stopped - without taking > the appropriate lock, because that would deadlock. That means that > drm_sched_fini() and drm_sched_entity_push_job() can race against each > other. > > This should most likely be fixed by establishing the rule that all > entities associated with a scheduler must be torn down first. Then, > however, the locking should be removed from drm_sched_fini() alltogether > with an appropriate comment. > > Reported-by: James Flowers <bold.zone2...@fastmail.com> > Link: > https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2...@fastmail.com/ > Signed-off-by: Philipp Stanner <pha...@kernel.org>
Reviewed-by: Christian König <christian.koe...@amd.com> > --- > Changes in v2: > - Fix typo. > --- > drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 5a550fd76bf0..46119aacb809 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > * Prevents reinsertion and marks job_queue as idle, > * it will be removed from the rq in > drm_sched_entity_fini() > * eventually > + * > + * FIXME: > + * This lacks the proper spin_lock(&s_entity->lock) and > + * is, therefore, a race condition. Most notably, it > + * can race with drm_sched_entity_push_job(). The lock > + * cannot be taken here, however, because this would > + * lead to lock inversion -> deadlock. > + * > + * The best solution probably is to enforce the life > + * time rule of all entities having to be torn down > + * before their scheduler. Then, however, locking could > + * be dropped alltogether from this function. > + * > + * For now, this remains a potential race in all > + * drivers that keep entities alive for longer than > + * the scheduler. > */ > s_entity->stopped = true; > spin_unlock(&rq->lock);