On 22/10/2025 07:34, Philipp Stanner wrote:
In a past bug fix it was forgotten that entity access must be protected
by the entity lock. That's a data race and potentially UB.

Move the spin_unlock() to the appropriate position.

Cc: [email protected] # v5.13+
Fixes: ac4eb83ab255 ("drm/sched: select new rq even if there is only one v3")
Signed-off-by: Philipp Stanner <[email protected]>
---
  drivers/gpu/drm/scheduler/sched_entity.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 5a4697f636f2..aa222166de58 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -552,10 +552,11 @@ void drm_sched_entity_select_rq(struct drm_sched_entity 
*entity)
                drm_sched_rq_remove_entity(entity->rq, entity);
                entity->rq = rq;
        }
-       spin_unlock(&entity->lock);
if (entity->num_sched_list == 1)
                entity->sched_list = NULL;
+
+       spin_unlock(&entity->lock);
  }
/**

Agreed, unlock at the end is correct.

Reviewed-by: Tvrtko Ursulin <[email protected]>

Luckily only amdgpu calls drm_sched_entity_modify_sched(), and I am not sure if with hypothetical multi-threaded submit it could be possible to hit it or not.

One thread would have to clear the sched_list, while another would have to have passed the !sched_list at the beginning of the function, and the job_queue count and last_scheduled completed guards. Which would mean job has to get completed between the unlock and clearing the pointer.

Also, take a look at this when you can please:

https://lore.kernel.org/dri-devel/[email protected]/

This would also have fixed this issue by removing clearing of entity->sched_list. Latter being just a hack to work around the issue of confused container ownership rules.

Regards,

Tvrtko

Reply via email to