On 22/10/2025 15:03, Danilo Krummrich wrote:
On Wed Oct 22, 2025 at 3:50 PM CEST, Tvrtko Ursulin wrote:
Yes, for the case when entity joins the run-queue it can be the same
entity which is now the head of the queue, or it can be a different one.
Depends on the insertion position.
But for the case where entity is leaving the run queue it is always a
different entity and therefore a lock inversion. We have essentially this:
lock entity
lock rq
remove entity from the rq
rq->prio = rq->head_entity->prio // different entity, unlocked read
unlock rq
unlock entity
This sounds like it repeates the unclear locking situation that is also
documented for struct drm_sched_rq:
* FIXME: Locking is very unclear for this. Writers are protected by
* @lock, but readers are generally lockless and seem to just race with
* not even a READ_ONCE.
This sounds pretty suspicious to me and I think it indicates a more fundamental
design issue that you now end up working around now.
I'm afraid it is not nearly the same. Guarantee that entity->rq is
stable is a multi-step one which depends on the job queue being non
empty and the last submitted job not being signalled. That side even
includes a smp_rmb() in drm_sched_entity_select_rq(). Code which does
the suspicious unlocked entity->rq access therefore claims to be certain
one or both of those conditions must be true.
What I am doing here is way, way simpler and IMO should not
controversial. It is well defined that entities can only enter and exit
the run queue with the rq->lock held. Which the code path holds, and the
functions asserts for. So a lockless read of an integer is nowhere near
the complexities of the FIXME you quote.
I'd like to dig in a bit more, but unfortunately it's very unlikely I will have
the time to do this until after LPC.
Should I interpret this as putting a blocker on the series until
effectively 2026?
Regards,
Tvrtko