Roman Gushchin <[email protected]> writes: > At Thu, 15 May 2014 10:43:14 -0700, > [email protected] wrote: >> >> Roman Gushchin <[email protected]> writes: >> >> > tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to >> > force the period timer restart. It's not safe, because >> > can lead to deadlock, described in commit 927b54fccbf0: >> > "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, >> > waiting for the hrtimer to finish. However, if sched_cfs_period_timer >> > runs for another loop iteration, the hrtimer can attempt to take >> > rq->lock, resulting in deadlock." >> > If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently >> > to the second iteration of sched_cfs_period_timer, deadlock occurs. >> >> I do not see this deadlock. cfs_b->timer_active is protected by >> cfs_b->lock on both read and write, and I don't think it would even >> directly cause deadlock if it wasn't. In particular this patch makes us >> run for strictly longer outside of the interrupt (although I think the >> patched version is also correct). The old issue was calling >> hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for >> the interrupt to complete, while the interrupt was waiting to take >> cfs_b->lock. > > I still think, there is a deadlock. I'll try to explain. > Three CPUs must be involved: > CPU0 CPU1 CPU2 > take rq->lock period timer fired > ... take cfs_b lock > ... ... > tg_set_cfs_bandwidth() > throttle_cfs_rq() release cfs_b lock take cfs_b lock > ... distribute_cfs_runtime() timer_active = 0 > take cfs_b->lock wait for rq->lock ... > __start_cfs_bandwidth() > {wait for timer callback > break if timer_active == 1} > > So, CPU0 and CPU1 are deadlocked.
Ahhh, yes, there is a three-cpu race here. Could you clarify the commit message that this is a race involving all three of tg_set_cfs_bandwidth, period timer, and then a second __start_cfs_bandwidth call? The code looks good however. >> I just did notice however that sched_cfs_period_timer reads >> cfs_b->period without any locks, which can in fact race with an update >> to period (and isn't fixed by this patch). If this write doesn't happen >> to be atomic (which is certainly not guaranteed, and is entirely >> plausible on say 32-bit x86, much less other archs), we could read a >> partially written value and move the timer incorrectly. Pulling the >> lock/unlock out of do_sched_cfs_period_timer should fix this easily >> enough. > > Looks like an another issue. Yeah. > > Thanks, > Roman > >> >> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked >> read of quota, and can also be easily fixed. >> >> >> > >> > Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can >> > wait for period timer callbacks (ignoring cfs_b->timer_active) and >> > restart the timer explicitly. >> > >> > Signed-off-by: Roman Gushchin <[email protected]> >> > Cc: Ben Segall <[email protected]> >> > Cc: Peter Zijlstra <[email protected]> >> > Cc: [email protected] >> > Cc: Chris J Arges <[email protected]> >> > Cc: Greg Kroah-Hartman <[email protected]> >> > --- >> > kernel/sched/core.c | 3 +-- >> > kernel/sched/fair.c | 8 ++++---- >> > kernel/sched/sched.h | 2 +- >> > 3 files changed, 6 insertions(+), 7 deletions(-) >> > >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> > index d9d8ece..e9b9c66 100644 >> > --- a/kernel/sched/core.c >> > +++ b/kernel/sched/core.c >> > @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group >> > *tg, u64 period, u64 quota) >> > /* restart the period timer (if active) to handle new period expiry */ >> > if (runtime_enabled && cfs_b->timer_active) { >> > /* force a reprogram */ >> > - cfs_b->timer_active = 0; >> > - __start_cfs_bandwidth(cfs_b); >> > + __start_cfs_bandwidth(cfs_b, true); >> > } >> > raw_spin_unlock_irq(&cfs_b->lock); >> > >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > index 7570dd9..55a0a5b 100644 >> > --- a/kernel/sched/fair.c >> > +++ b/kernel/sched/fair.c >> > @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq >> > *cfs_rq) >> > */ >> > if (!cfs_b->timer_active) { >> > __refill_cfs_bandwidth_runtime(cfs_b); >> > - __start_cfs_bandwidth(cfs_b); >> > + __start_cfs_bandwidth(cfs_b, false); >> > } >> > >> > if (cfs_b->runtime > 0) { >> > @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) >> > raw_spin_lock(&cfs_b->lock); >> > list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); >> > if (!cfs_b->timer_active) >> > - __start_cfs_bandwidth(cfs_b); >> > + __start_cfs_bandwidth(cfs_b, false); >> > raw_spin_unlock(&cfs_b->lock); >> > } >> > >> > @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq >> > *cfs_rq) >> > } >> > >> > /* requires cfs_b->lock, may release to reprogram timer */ >> > -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) >> > +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force) >> > { >> > /* >> > * The timer may be active because we're trying to set a new bandwidth >> > @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth >> > *cfs_b) >> > cpu_relax(); >> > raw_spin_lock(&cfs_b->lock); >> > /* if someone else restarted the timer then we're done */ >> > - if (cfs_b->timer_active) >> > + if (!force && cfs_b->timer_active) >> > return; >> > } >> > >> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> > index 456e492..369b4d6 100644 >> > --- a/kernel/sched/sched.h >> > +++ b/kernel/sched/sched.h >> > @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth >> > *cfs_b); >> > extern int sched_group_set_shares(struct task_group *tg, unsigned long >> > shares); >> > >> > extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b); >> > -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b); >> > +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool >> > force); >> > extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq); >> > >> > extern void free_rt_sched_group(struct task_group *tg); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

