On Wed Oct 22, 2025 at 4:24 PM CEST, Tvrtko Ursulin wrote:
>
> On 22/10/2025 15:03, Danilo Krummrich wrote:
>> On Wed Oct 22, 2025 at 3:50 PM CEST, Tvrtko Ursulin wrote:
>>> Yes, for the case when entity joins the run-queue it can be the same
>>> entity which is now the head of the queue, or it can be a different one.
>>> Depends on the insertion position.
>>>
>>> But for the case where entity is leaving the run queue it is always a
>>> different entity and therefore a lock inversion. We have essentially this:
>>>
>>> lock entity
>>> lock rq
>>> remove entity from the rq
>>> rq->prio = rq->head_entity->prio // different entity, unlocked read
>>> unlock rq
>>> unlock entity
>> 
>> This sounds like it repeates the unclear locking situation that is also
>> documented for struct drm_sched_rq:
>> 
>>       * FIXME: Locking is very unclear for this. Writers are protected by
>>       * @lock, but readers are generally lockless and seem to just race with
>>       * not even a READ_ONCE.
>> 
>> This sounds pretty suspicious to me and I think it indicates a more 
>> fundamental
>> design issue that you now end up working around now.
>
> I'm afraid it is not nearly the same. Guarantee that entity->rq is 
> stable is a multi-step one which depends on the job queue being non 
> empty and the last submitted job not being signalled. That side even 
> includes a smp_rmb() in drm_sched_entity_select_rq(). Code which does 
> the suspicious unlocked entity->rq access therefore claims to be certain 
> one or both of those conditions must be true.
>
> What I am doing here is way, way simpler and IMO should not 
> controversial. It is well defined that entities can only enter and exit 
> the run queue with the rq->lock held. Which the code path holds, and the 
> functions asserts for. So a lockless read of an integer is nowhere near 
> the complexities of the FIXME you quote.

What I'm saying is that the pattern is the same, the writer side protects an
entity field with a lock, whereas the reader side does not.

And the fact that this is done for locking inversion reasons indicates a design
issue -- not necessarily with your code, but maybe with the existing code.

>> I'd like to dig in a bit more, but unfortunately it's very unlikely I will 
>> have
>> the time to do this until after LPC.
>
> Should I interpret this as putting a blocker on the series until 
> effectively 2026?

It means that I won't have time to help by digging in and see if it's in fact an
issue you inherit from the existing code.

Reply via email to