On 22/10/2025 16:32, Boris Brezillon wrote: > On Wed, 22 Oct 2025 15:28:51 +0100 > Steven Price <[email protected]> wrote: > >> On 22/10/2025 15:00, Boris Brezillon wrote: >>> On Wed, 22 Oct 2025 14:36:23 +0100 >>> Steven Price <[email protected]> wrote: >>> >>>> On 22/10/2025 13:37, Boris Brezillon wrote: >>>>> On Wed, 22 Oct 2025 12:30:13 +0200 >>>>> Ketil Johnsen <[email protected]> wrote: >>>>> >>>>>> The function panthor_fw_unplug() will free the FW memory sections. >>>>>> The problem is that there could still be pending FW events which are yet >>>>>> not handled at this point. process_fw_events_work() can in this case try >>>>>> to access said freed memory. >>>>>> >>>>>> This fix introduces a destroyed state for the panthor_scheduler object, >>>>>> and we check for this before processing FW events. >>>>>> >>>>>> Signed-off-by: Ketil Johnsen <[email protected]> >>>>>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block") >>>>>> --- >>>>>> v2: >>>>>> - Followed Boris's advice and handle the race purely within the >>>>>> scheduler block (by adding a destroyed state) >>>>>> --- >>>>>> drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++--- >>>>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c >>>>>> b/drivers/gpu/drm/panthor/panthor_sched.c >>>>>> index 0cc9055f4ee52..4996f987b8183 100644 >>>>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c >>>>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c >>>>>> @@ -315,6 +315,13 @@ struct panthor_scheduler { >>>>>> */ >>>>>> struct list_head stopped_groups; >>>>>> } reset; >>>>>> + >>>>>> + /** >>>>>> + * @destroyed: Scheduler object is (being) destroyed >>>>>> + * >>>>>> + * Normal scheduler operations should no longer take place. >>>>>> + */ >>>>>> + bool destroyed; >>>>> >>>>> Do we really need a new field for that? Can't we just reset >>>>> panthor_device::scheduler to NULL early enough in the unplug path? >>>>> I guess it's not that simple if we have works going back to ptdev >>>>> and then dereferencing ptdev->scheduler, but I think it's also >>>>> fundamentally broken to have scheduler works active after the >>>>> scheduler teardown has started, so we might want to add some more >>>>> checks in the work callbacks too. >>>>> >>>>>> }; >>>>>> >>>>>> /** >>>>>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct >>>>>> work_struct *work) >>>>>> u32 events = atomic_xchg(&sched->fw_events, 0); >>>>>> struct panthor_device *ptdev = sched->ptdev; >>>>>> >>>>>> - mutex_lock(&sched->lock); >>>>>> + guard(mutex)(&sched->lock); >>>>>> + >>>>>> + if (sched->destroyed) >>>>>> + return; >>>>>> >>>>>> if (events & JOB_INT_GLOBAL_IF) { >>>>>> sched_process_global_irq_locked(ptdev); >>>>>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct >>>>>> work_struct *work) >>>>>> sched_process_csg_irq_locked(ptdev, csg_id); >>>>>> events &= ~BIT(csg_id); >>>>>> } >>>>>> - >>>>>> - mutex_unlock(&sched->lock); >>>>>> } >>>>>> >>>>>> /** >>>>>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device >>>>>> *ptdev) >>>>>> cancel_delayed_work_sync(&sched->tick_work); >>>>>> >>>>>> mutex_lock(&sched->lock); >>>>>> + sched->destroyed = true; >>>>>> if (sched->pm.has_ref) { >>>>>> pm_runtime_put(ptdev->base.dev); >>>>>> sched->pm.has_ref = false; >>>>> >>>>> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work) >>>>> rather than letting the work execute after we've started tearing down >>>>> the scheduler object. >>>>> >>>>> If you follow my suggestion to reset the ptdev->scheduler field, I >>>>> guess something like that would do: >>>>> >>>>> void panthor_sched_unplug(struct panthor_device *ptdev) >>>>> { >>>>> struct panthor_scheduler *sched = ptdev->scheduler; >>>>> >>>>> /* We want the schedu */ >>>>> WRITE_ONCE(*ptdev->scheduler, NULL); >>>>> >>>>> cancel_work_sync(&sched->fw_events_work); >>>>> cancel_delayed_work_sync(&sched->tick_work); >>>>> >>>>> mutex_lock(&sched->lock); >>>>> if (sched->pm.has_ref) { >>>>> pm_runtime_put(ptdev->base.dev); >>>>> sched->pm.has_ref = false; >>>>> } >>>>> mutex_unlock(&sched->lock); >>>>> } >>>>> >>>>> and >>>>> >>>>> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 >>>>> events) { >>>>> struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler); >>>>> >>>>> /* Scheduler is not initialized, or it's gone. */ >>>>> if (!sched) >>>>> return; >>>>> >>>>> atomic_or(events, &sched->fw_events); >>>>> sched_queue_work(sched, fw_events); >>>>> } >>>> >>>> Note there's also the path of panthor_mmu_irq_handler() calling >>>> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well >>>> to be safe. >>> >>> This could be hidden behind a panthor_device_get_sched() helper, I >>> guess. Anyway, it's not so much that I'm against the addition of an >>> extra bool, but AFAICT, the problem is not entirely solved, as there >>> could be a pending work that gets executed after sched_unplug() >>> returns, and I adding this bool check just papers over the real bug >>> (which is that we never cancel the fw_event work). >>> >>>> >>>> I agree having an extra bool is ugly, but it easier to reason about than >>>> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will >>>> be regressed in the future. I can't immediately see how to wrap this in >>>> a helper to ensure this is kept correct. >>> >>> Sure, but you're not really catching cases where the work runs after >>> the scheduler component has been unplugged in case someone forgot to >>> cancel some works. I think I'd rather identify those cases with a >>> kernel panic, than a random UAF when the work is being executed. >>> Ultimately, we should probably audit all works used in the driver, to >>> make sure they are properly cancelled at unplug() time by the relevant >>> <component>_unplug() functions. >> >> Yes I agree, we should have a cancel_work_sync(&sched->fw_events_work) >> call somewhere on the unplug path. That needs to be after the job irq >> has been disabled which is currently done in panthor_fw_unplug(). > > Not necessarily. If we prevent any further FW events to queue the > fw_events work, we can just cancel it in the sched_unplug() path, after > we've transition to this "sched-is-gone" state.
True that would also work. Steve
