On Tue, 2025-04-22 at 13:07 +0100, Tvrtko Ursulin wrote: > > On 22/04/2025 12:13, Danilo Krummrich wrote: > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote: > > > Question I raised is if there are other drivers which manage to > > > clean up > > > everything correctly (like the mock scheduler does), but trigger > > > that > > > warning. Maybe there are not and maybe mock scheduler is the only > > > false > > > positive. > > > > So far the scheduler simply does not give any guideline on how to > > address the > > problem, hence every driver simply does something (or nothing, > > effectively > > ignoring the problem). This is what we want to fix. > > > > The mock scheduler keeps it's own list of pending jobs and on tear > > down stops > > the scheduler's workqueues, traverses it's own list and eventually > > frees the > > pending jobs without updating the scheduler's internal pending > > list. > > > > So yes, it does avoid memory leaks, but it also leaves the > > schedulers internal > > structures with an invalid state, i.e. the pending list of the > > scheduler has > > pointers to already freed memory. > > > > What if the drm_sched_fini() starts touching the pending list? Then > > you'd end up > > with UAF bugs with this implementation. We cannot invalidate the > > schedulers > > internal structures and yet call scheduler functions - e.g. > > drm_sched_fini() - > > subsequently. > > > > Hence, the current implementation of the mock scheduler is > > fundamentally flawed. > > And so would be *every* driver that still has entries within the > > scheduler's > > pending list. > > > > This is not a false positive, it already caught a real bug -- in > > the mock > > scheduler. > > To avoid furher splitting hairs on whether real bugs need to be able > to > manifest or not, lets move past this with a conclusion that there are > two potential things to do here: > > First one is to either send separately or include in this series > something like: > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > index f999c8859cf7..7c4df0e890ac 100644 > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct > drm_mock_scheduler > *sched) > drm_mock_sched_job_complete(job); > spin_unlock_irqrestore(&sched->lock, flags); > > + drm_sched_fini(&sched->base); > + > /* > * Free completed jobs and jobs not yet processed by the DRM > scheduler > * free worker. > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct > drm_mock_scheduler > *sched) > > list_for_each_entry_safe(job, next, &list, link) > mock_sched_free_job(&job->base); > - > - drm_sched_fini(&sched->base); > } > > /** > > That should satisfy the requirement to "clear" memory about to be > freed > and be 100% compliant with drm_sched_fini() kerneldoc (guideline b). > > But the new warning from 3/5 here will still be there AFAICT and > would > you then agree it is a false positive? > > Secondly, the series should modify all drivers (including the unit > tests) which are known to trigger this false positive.
There is no false positive. I just stated that in my other answer. That said, I agree that the unit tests should be a second user of this code in addition to Nouveau. They then should also provide the fence_context_kill() callback, though. And it seems we also agree that the memory leaks must be solved centrally in drm_sched_fini(). P. > > Regards, > > Tvrtko >