On 17/11/2025 10:26, Boris Brezillon wrote: > On Mon, 17 Nov 2025 09:49:27 +0000 > Steven Price <[email protected]> wrote: > >> On 12/11/2025 12:47, Boris Brezillon wrote: >>> On Fri, 7 Nov 2025 16:40:53 +0000 >>> Steven Price <[email protected]> wrote: >>> >>>> On 31/10/2025 15:48, Boris Brezillon wrote: >>>>> A group can become runnable even after reset.in_progress has >>>>> been set to true and panthor_sched_suspend() has been called, >>>>> because the drm_sched queues are still running at that point, >>>>> and ::run_job() might call group_schedule_locked() which moves >>>>> the group to the runnable list. And that's fine, because we're >>>>> moving those groups to the stopped list anyway when we call >>>>> panthor_group_stop(), so just drop the misleading WARN_ON(). >>>> >>>> If we've got another thread mutating the runnable list between >>>> panthor_sched_suspend() and list_for_each_entry_safe(), doesn't that >>>> make the list iterator unsafe? (_safe only protects against deleting the >>>> current item, not against concurrent access). >>> >>> I'm not too sure actually. There's an >>> atomic_read(&sched->reset.in_progress) to check if we're about to reset >>> in group_schedule_locked() and cancel the insertion into the runnable >>> list in that case, meaning we're sure nothing new will be inserted after >>> we've set the in_progress=true in panthor_sched_pre_reset(). >> >> I was mostly going on your commit message: >> >>> A group can become runnable even after reset.in_progress has >>> been set to true and panthor_sched_suspend() has been called >> >> if that is indeed happening then we have a problem (and removing the >> WARN_ON is just papering over it). I haven't actually followed through >> the logic. > > Sorry, it's not exactly that. The problem is that a group might be > inserted in the runnable list before we've had a chance to set > reset.in_progress=true (earlier in this function), and there's nothing > removing those groups from the runnable list between this assignment and > the loop stopping the groups.
Ok, so the commit message is misleading... My understanding is that the tick and sync_upd is what moves things onto the runnable list. So we set the "reset.in_progress" flag and cancel (and sync) with the two work queues which could move things onto runnable. panthor_sched_suspend() then deals with anything actually running, but it does appear we're missing anything to deal with the runnable list, hence the WARN_ON needs dropping. >> >>>> >>>> It feels to me like we should be holding the sched mutex - at least >>>> while iterating. I agree the WARN_ON is unnecessary, and will need >>>> removing if we simply guard the iteration - the alternative is to >>>> recolour panthor_sched_suspend() to assume the lock is held (and take >>>> the lock in panthor_sched_pre_reset), but I suspect that's a more ugly >>>> change. >>> >>> I'd rather ensure that nothing new is inserted in the runnable/idle >>> lists after sched->reset.in_progress is set to true. Note that >>> sched->reset.in_progress is set to true with the sched lock held, >>> meaning any path modifying the sched lists (must be done with the sched >>> lock held) should complete before we set this to true. As long as those >>> paths also skip the list insertion, or, for things happening in a work >>> context (thinking of the tick work here), as long as the work is not >>> rescheduled until we get a chance to disable this work, we should be >>> good, no? >> >> Yes that design can work. But atomics can be difficult to reason about, >> so unless there's a good reason I think we'd generally be better >> sticking with (simple) locks > > Locks alone won't prevent groups from being moved around after the > stop_all_groups loop though. It's the lock plus the fact groups can't > have their state changed while a reset is in progress that gives this > guarantee, at which point I guess checking reset.in_progress with or > without the lock held is the same, no? We do change the > reset.in_progress state with the sched.reset.lock held though, to make > sure any path that could move the group to a different list is out of > the way when we exit the locked section. That means that new threads > entering such paths (path changing the group state) will see the new > value and bail out. No you're right - the lock by itself won't work because we need to drop all locks before returning from pre_reset() and we need to keep that state until post_reset(). >> on the slow paths, then we get the benefits >> of lockdep etc checking we haven't messed up. > > I don't mind taking the sched lock in this slow path, but I don't think > taking/releasing it inside panthor_sched_pre_reset() is giving any more > safeness, because what we want is a guarantee that groups won't be > moved around between panthor_sched_pre_reset() and > panthor_sched_post_reset(). If that's really a change we want to push > (reworking the locking in the reset sequence), I'd rather do that in > its own patchset, if you don't mind. I think the real issue was the commit message misleading me. I'd assumed from the commit message that you'd observed the situation of a group being put on the runnable list between the call to panthor_sched_suspend() and the drm_WARN_ON check. If that really did happen then there's a much deeper bug somewhere that needs fixing. If I understand correctly what you've actually observed is a group being put on the runnable list *before* panthor_sched_suspend() and we're not currently handling that[1]. I agree that simply dropping the drm_WARN_ON() would be the logic fix for that. TLDR; The commit message needs some work ;) Thanks, Steve [1] Well we are, but with a big WARN_ON splat ;)
