On 22/10/2025 07:34, Philipp Stanner wrote:
In a past bug fix it was forgotten that entity access must be protected
by the entity lock. That's a data race and potentially UB.
Move the spin_unlock() to the appropriate position.
Cc: [email protected] # v5.13+
Fixes: ac4eb83ab255 ("drm/sched: select new rq even if there is only one v3")
Signed-off-by: Philipp Stanner <[email protected]>
---
drivers/gpu/drm/scheduler/sched_entity.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index 5a4697f636f2..aa222166de58 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -552,10 +552,11 @@ void drm_sched_entity_select_rq(struct drm_sched_entity
*entity)
drm_sched_rq_remove_entity(entity->rq, entity);
entity->rq = rq;
}
- spin_unlock(&entity->lock);
if (entity->num_sched_list == 1)
entity->sched_list = NULL;
+
+ spin_unlock(&entity->lock);
}
/**
Agreed, unlock at the end is correct.
Reviewed-by: Tvrtko Ursulin <[email protected]>
Luckily only amdgpu calls drm_sched_entity_modify_sched(), and I am not
sure if with hypothetical multi-threaded submit it could be possible to
hit it or not.
One thread would have to clear the sched_list, while another would have
to have passed the !sched_list at the beginning of the function, and the
job_queue count and last_scheduled completed guards. Which would mean
job has to get completed between the unlock and clearing the pointer.
Also, take a look at this when you can please:
https://lore.kernel.org/dri-devel/[email protected]/
This would also have fixed this issue by removing clearing of
entity->sched_list. Latter being just a hack to work around the issue of
confused container ownership rules.
Regards,
Tvrtko