Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider: - [High] Data race and lost updates on `group->fatal_queues` due to split locking. - [Medium] Data race on `group->timedout` boolean flag due to split locking. - [Medium] Data race on `sched->might_have_idle_groups` due to split locking. --
commit db6f3f8002805cdf65efff6be24738f26e5f60a4 Author: Boris Brezillon <[email protected]> drm/panthor: Protect events processing with a separate spinlock This commit introduces a new events_lock spinlock in the panthor scheduler. It aims to protect event processing and command stream group (CSG) slot updates, allowing some event processing to be moved to a threaded IRQ handler. > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c > b/drivers/gpu/drm/panthor/panthor_sched.c > index 5b34032deff81..4f5d874e5d468 100644 > --- a/drivers/gpu/drm/panthor/panthor_sched.c > +++ b/drivers/gpu/drm/panthor/panthor_sched.c [ ... ] > @@ -1484,7 +1497,7 @@ cs_slot_process_fatal_event_locked(struct > panthor_device *ptdev, > u32 fatal; > u64 info; > > - lockdep_assert_held(&sched->lock); > + lockdep_assert_held(&sched->events_lock); > > cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id); > fatal = cs_iface->output->fatal; [Severity: High] With this change, cs_slot_process_fatal_event_locked() now executes with only events_lock held, which means the subsequent update to group->fatal_queues inside this function is no longer protected by sched->lock: group->fatal_queues |= BIT(cs_id); However, in tick_ctx_init(), we update this same field while holding sched->lock, but after dropping the scoped events_lock: if (!group->fatal_queues) group->fatal_queues |= GENMASK(group->queue_count - 1, 0); Could this result in a data race and lost updates on group->fatal_queues if these two paths execute concurrently on different CPUs? If the non-atomic read-modify-write operations overlap, one of the updates might be lost, potentially leaving a faulty queue unflagged. This is a pre-existing issue, but it's worth noting that group_process_tiler_oom() also performs a lockless read-modify-write on group->fatal_queues: group->fatal_queues |= BIT(cs_id); [ ... ] > @@ -1731,7 +1744,7 @@ static void csg_slot_process_idle_event_locked(struct > panthor_device *ptdev, u32 > { > struct panthor_scheduler *sched = ptdev->scheduler; > > - lockdep_assert_held(&sched->lock); > + lockdep_assert_held(&sched->events_lock); > > sched->might_have_idle_groups = true; > [Severity: Medium] Will this lead to a data race on sched->might_have_idle_groups? csg_slot_process_idle_event_locked() sets this flag to true while holding events_lock. But in tick_ctx_apply(), the flag is overwritten under sched->lock without holding events_lock: sched->might_have_idle_groups = ctx->idle_group_count > 0; Because these paths do not share a common lock, they can execute concurrently, which violates the C memory model and might trigger KCSAN warnings. [ ... ] > @@ -1763,7 +1776,7 @@ csg_slot_process_progress_timer_event_locked(struct > panthor_device *ptdev, u32 c > struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id]; > struct panthor_group *group = csg_slot->group; > > - lockdep_assert_held(&sched->lock); > + lockdep_assert_held(&sched->events_lock); > > group = csg_slot->group; > if (!drm_WARN_ON(&ptdev->base, !group)) { [Severity: Medium] Since csg_slot_process_progress_timer_event_locked() now holds events_lock instead of sched->lock, is it still safe to modify group->timedout here? group->timedout = true; Other paths like panthor_sched_suspend() and panthor_job_timedout() modify and read group->timedout while holding only sched->lock: csg_slot->group->timedout = true; Concurrent writes from the event processing workqueue and the suspend/timeout paths could cause a data race on this boolean flag. -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=4
