On Wed Oct 22, 2025 at 4:24 PM CEST, Tvrtko Ursulin wrote: > > On 22/10/2025 15:03, Danilo Krummrich wrote: >> On Wed Oct 22, 2025 at 3:50 PM CEST, Tvrtko Ursulin wrote: >>> Yes, for the case when entity joins the run-queue it can be the same >>> entity which is now the head of the queue, or it can be a different one. >>> Depends on the insertion position. >>> >>> But for the case where entity is leaving the run queue it is always a >>> different entity and therefore a lock inversion. We have essentially this: >>> >>> lock entity >>> lock rq >>> remove entity from the rq >>> rq->prio = rq->head_entity->prio // different entity, unlocked read >>> unlock rq >>> unlock entity >> >> This sounds like it repeates the unclear locking situation that is also >> documented for struct drm_sched_rq: >> >> * FIXME: Locking is very unclear for this. Writers are protected by >> * @lock, but readers are generally lockless and seem to just race with >> * not even a READ_ONCE. >> >> This sounds pretty suspicious to me and I think it indicates a more >> fundamental >> design issue that you now end up working around now. > > I'm afraid it is not nearly the same. Guarantee that entity->rq is > stable is a multi-step one which depends on the job queue being non > empty and the last submitted job not being signalled. That side even > includes a smp_rmb() in drm_sched_entity_select_rq(). Code which does > the suspicious unlocked entity->rq access therefore claims to be certain > one or both of those conditions must be true. > > What I am doing here is way, way simpler and IMO should not > controversial. It is well defined that entities can only enter and exit > the run queue with the rq->lock held. Which the code path holds, and the > functions asserts for. So a lockless read of an integer is nowhere near > the complexities of the FIXME you quote.
What I'm saying is that the pattern is the same, the writer side protects an entity field with a lock, whereas the reader side does not. And the fact that this is done for locking inversion reasons indicates a design issue -- not necessarily with your code, but maybe with the existing code. >> I'd like to dig in a bit more, but unfortunately it's very unlikely I will >> have >> the time to do this until after LPC. > > Should I interpret this as putting a blocker on the series until > effectively 2026? It means that I won't have time to help by digging in and see if it's in fact an issue you inherit from the existing code.
