On 17/11/2025 10:26, Boris Brezillon wrote:
> On Mon, 17 Nov 2025 09:49:27 +0000
> Steven Price <[email protected]> wrote:
> 
>> On 12/11/2025 12:47, Boris Brezillon wrote:
>>> On Fri, 7 Nov 2025 16:40:53 +0000
>>> Steven Price <[email protected]> wrote:
>>>   
>>>> On 31/10/2025 15:48, Boris Brezillon wrote:  
>>>>> A group can become runnable even after reset.in_progress has
>>>>> been set to true and panthor_sched_suspend() has been called,
>>>>> because the drm_sched queues are still running at that point,
>>>>> and ::run_job() might call group_schedule_locked() which moves
>>>>> the group to the runnable list. And that's fine, because we're
>>>>> moving those groups to the stopped list anyway when we call
>>>>> panthor_group_stop(), so just drop the misleading WARN_ON().    
>>>>
>>>> If we've got another thread mutating the runnable list between
>>>> panthor_sched_suspend() and list_for_each_entry_safe(), doesn't that
>>>> make the list iterator unsafe? (_safe only protects against deleting the
>>>> current item, not against concurrent access).  
>>>
>>> I'm not too sure actually. There's an
>>> atomic_read(&sched->reset.in_progress) to check if we're about to reset
>>> in group_schedule_locked() and cancel the insertion into the runnable
>>> list in that case, meaning we're sure nothing new will be inserted after
>>> we've set the in_progress=true in panthor_sched_pre_reset().  
>>
>> I was mostly going on your commit message:
>>
>>> A group can become runnable even after reset.in_progress has
>>> been set to true and panthor_sched_suspend() has been called  
>>
>> if that is indeed happening then we have a problem (and removing the
>> WARN_ON is just papering over it). I haven't actually followed through
>> the logic.
> 
> Sorry, it's not exactly that. The problem is that a group might be
> inserted in the runnable list before we've had a chance to set
> reset.in_progress=true (earlier in this function), and there's nothing
> removing those groups from the runnable list between this assignment and
> the loop stopping the groups.

Ok, so the commit message is misleading...

My understanding is that the tick and sync_upd is what moves things onto
the runnable list. So we set the "reset.in_progress" flag and cancel
(and sync) with the two work queues which could move things onto runnable.

panthor_sched_suspend() then deals with anything actually running, but
it does appear we're missing anything to deal with the runnable list,
hence the WARN_ON needs dropping.

>>
>>>>
>>>> It feels to me like we should be holding the sched mutex - at least
>>>> while iterating. I agree the WARN_ON is unnecessary, and will need
>>>> removing if we simply guard the iteration - the alternative is to
>>>> recolour panthor_sched_suspend() to assume the lock is held (and take
>>>> the lock in panthor_sched_pre_reset), but I suspect that's a more ugly
>>>> change.  
>>>
>>> I'd rather ensure that nothing new is inserted in the runnable/idle
>>> lists after sched->reset.in_progress is set to true. Note that
>>> sched->reset.in_progress is set to true with the sched lock held,
>>> meaning any path modifying the sched lists (must be done with the sched
>>> lock held) should complete before we set this to true. As long as those
>>> paths also skip the list insertion, or, for things happening in a work
>>> context (thinking of the tick work here), as long as the work is not
>>> rescheduled until we get a chance to disable this work, we should be
>>> good, no?  
>>
>> Yes that design can work. But atomics can be difficult to reason about,
>> so unless there's a good reason I think we'd generally be better
>> sticking with (simple) locks
> 
> Locks alone won't prevent groups from being moved around after the
> stop_all_groups loop though. It's the lock plus the fact groups can't
> have their state changed while a reset is in progress that gives this
> guarantee, at which point I guess checking reset.in_progress with or
> without the lock held is the same, no? We do change the
> reset.in_progress state with the sched.reset.lock held though, to make
> sure any path that could move the group to a different list is out of
> the way when we exit the locked section. That means that new threads
> entering such paths (path changing the group state) will see the new
> value and bail out.

No you're right - the lock by itself won't work because we need to drop
all locks before returning from pre_reset() and we need to keep that
state until post_reset().

>> on the slow paths, then we get the benefits
>> of lockdep etc checking we haven't messed up.
> 
> I don't mind taking the sched lock in this slow path, but I don't think
> taking/releasing it inside panthor_sched_pre_reset() is giving any more
> safeness, because what we want is a guarantee that groups won't be
> moved around between panthor_sched_pre_reset() and
> panthor_sched_post_reset(). If that's really a change we want to push
> (reworking the locking in the reset sequence), I'd rather do that in
> its own patchset, if you don't mind.

I think the real issue was the commit message misleading me. I'd assumed
from the commit message that you'd observed the situation of a group
being put on the runnable list between the call to
panthor_sched_suspend() and the drm_WARN_ON check. If that really did
happen then there's a much deeper bug somewhere that needs fixing.

If I understand correctly what you've actually observed is a group being
put on the runnable list *before* panthor_sched_suspend() and we're not
currently handling that[1]. I agree that simply dropping the
drm_WARN_ON() would be the logic fix for that.

TLDR; The commit message needs some work ;)

Thanks,
Steve

[1] Well we are, but with a big WARN_ON splat ;)

Reply via email to