On 3/13/26 20:35, Konstantin Khorenko wrote: > The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is started by > dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement. > Its callback sched_cfs_active_timer() dereferences cfs_rq->tg. > > When a task group is destroyed, unregister_fair_sched_group() tears > down the per-CPU cfs_rq structures but never cancels the active_timer. > If the timer fires after the cfs_rq and task_group memory have been > freed and reallocated (all three — cfs_rq, sched_entity, and > task_group — live in the kmalloc-1k slab cache), the callback performs > atomic_dec() on an arbitrary kernel address, corrupting memory. > > This was observed as a hard lockup + NULL-pointer Oops in > enqueue_task_fair() during task wakeup: the se->parent pointer at > offset 128 in a sched_entity was corrupted because it shares the same > slab offset as cfs_rq->skip (also offset 128) — a classic cross-type > UAF in a shared slab cache. > > Fix this in two ways: > > 1. Cancel the active_timer in unregister_fair_sched_group() before the > cfs_rq is freed. The cancel must happen outside the rq lock because > the timer callback sched_cfs_active_timer() acquires it. > > 2. Move the atomic_dec(&cfs_rq->tg->nr_cpus_active) inside the rq lock > in sched_cfs_active_timer(). In the original code the callback > releases the rq lock before executing atomic_dec, creating a window > where the teardown path can run between the unlock and the > atomic_dec: > > CPU B (timer callback) CPU A (teardown) > ────────────────────── ──────────────── > sched_cfs_active_timer() > raw_spin_rq_lock() > cfs_rq->active = ... > raw_spin_rq_unlock() > ← lock released, atomic_dec > not yet executed > unregister_fair_sched_group() > raw_spin_rq_lock() > list_del_leaf_cfs_rq()
This explanation and stack seem incorrect. As we already add hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer); (in first part of the fix) before list_del_leaf_cfs_rq(), which will already wait for (timer callback) sched_cfs_active_timer() to finish before continuing to free tg. So the second part of the fix is excess. > raw_spin_rq_unlock() > sched_free_group() > kfree(tg) > atomic_dec(&cfs_rq->tg->...) ← UAF! tg already freed > > With atomic_dec inside the rq lock, teardown's raw_spin_rq_lock() > in unregister_fair_sched_group() cannot proceed until the callback > has completed all accesses to cfs_rq->tg, eliminating the window. > > Note: hrtimer_cancel() (fix #1) also independently closes this > race — the hrtimer core keeps base->running == timer until fn() > returns, so hrtimer_cancel() waits for the callback to fully > complete, including atomic_dec. Moving atomic_dec under the lock > makes the serialization explicit via the lock protocol without > relying on hrtimer internals. > > https://virtuozzo.atlassian.net/browse/VSTOR-126785 > > Signed-off-by: Konstantin Khorenko <[email protected]> > > Feature: sched: ability to limit number of CPUs available to a CT > --- > kernel/sched/fair.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cc5dceb9c815f..9b0fe4c8a272f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -597,9 +597,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct > hrtimer *timer) > > raw_spin_rq_lock_irqsave(rq, flags); > cfs_rq->active = !list_empty(&cfs_rq->tasks); > - raw_spin_rq_unlock_irqrestore(rq, flags); > - > atomic_dec(&cfs_rq->tg->nr_cpus_active); > + raw_spin_rq_unlock_irqrestore(rq, flags); > > return HRTIMER_NORESTART; > } > @@ -13020,6 +13019,16 @@ void unregister_fair_sched_group(struct task_group > *tg) > destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); > > for_each_possible_cpu(cpu) { > +#ifdef CONFIG_CFS_CPULIMIT > + /* > + * Cancel the per-cfs_rq active timer before freeing. > + * The callback dereferences cfs_rq->tg, so failing to > + * cancel leads to use-after-free once the tg is freed. > + * Must be done outside the rq lock since the callback > + * acquires it. > + */ > + hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer); > +#endif > if (tg->se[cpu]) > remove_entity_load_avg(tg->se[cpu]); > -- Best regards, Pavel Tikhomirov Senior Software Developer, Virtuozzo. _______________________________________________ Devel mailing list [email protected] https://lists.openvz.org/mailman/listinfo/devel
