Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

bsegall Fri, 16 May 2014 10:58:54 -0700

Roman Gushchin <[email protected]> writes:

> At Thu, 15 May 2014 10:43:14 -0700,
> [email protected] wrote:
>> 
>> Roman Gushchin <[email protected]> writes:
>> 
>> > tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
>> > force the period timer restart. It's not safe, because
>> > can lead to deadlock, described in commit 927b54fccbf0:
>> > "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
>> > waiting for the hrtimer to finish. However, if sched_cfs_period_timer
>> > runs for another loop iteration, the hrtimer can attempt to take
>> > rq->lock, resulting in deadlock."
>> > If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
>> > to the second iteration of sched_cfs_period_timer, deadlock occurs.
>> 
>> I do not see this deadlock. cfs_b->timer_active is protected by
>> cfs_b->lock on both read and write, and I don't think it would even
>> directly cause deadlock if it wasn't. In particular this patch makes us
>> run for strictly longer outside of the interrupt (although I think the
>> patched version is also correct). The old issue was calling
>> hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for
>> the interrupt to complete, while the interrupt was waiting to take
>> cfs_b->lock.
>
> I still think, there is a deadlock. I'll try to explain.
> Three CPUs must be involved:
> CPU0                          CPU1                            CPU2
> take rq->lock                 period timer fired              
> ...                           take cfs_b lock
> ...                           ...                             
> tg_set_cfs_bandwidth()
> throttle_cfs_rq()             release cfs_b lock              take cfs_b lock
> ...                           distribute_cfs_runtime()        timer_active = 0
> take cfs_b->lock              wait for rq->lock               ...
> __start_cfs_bandwidth()       
> {wait for timer callback
>  break if timer_active == 1}
>
> So, CPU0 and CPU1 are deadlocked.


Ahhh, yes, there is a three-cpu race here. Could you clarify the commit
message that this is a race involving all three of tg_set_cfs_bandwidth,
period timer, and then a second __start_cfs_bandwidth call?

The code looks good however.

>> I just did notice however that sched_cfs_period_timer reads
>> cfs_b->period without any locks, which can in fact race with an update
>> to period (and isn't fixed by this patch). If this write doesn't happen
>> to be atomic (which is certainly not guaranteed, and is entirely
>> plausible on say 32-bit x86, much less other archs), we could read a
>> partially written value and move the timer incorrectly. Pulling the
>> lock/unlock out of do_sched_cfs_period_timer should fix this easily
>> enough.
>
> Looks like an another issue.

Yeah.
>
> Thanks,
> Roman
>
>> 
>> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
>> read of quota, and can also be easily fixed.
>> 
>> 
>> >
>> > Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
>> > wait for period timer callbacks (ignoring cfs_b->timer_active) and
>> > restart the timer explicitly.
>> >
>> > Signed-off-by: Roman Gushchin <[email protected]>
>> > Cc: Ben Segall <[email protected]>
>> > Cc: Peter Zijlstra <[email protected]>
>> > Cc: [email protected]
>> > Cc: Chris J Arges <[email protected]>
>> > Cc: Greg Kroah-Hartman <[email protected]>
>> > ---
>> >  kernel/sched/core.c  |    3 +--
>> >  kernel/sched/fair.c  |    8 ++++----
>> >  kernel/sched/sched.h |    2 +-
>> >  3 files changed, 6 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index d9d8ece..e9b9c66 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group 
>> > *tg, u64 period, u64 quota)
>> >    /* restart the period timer (if active) to handle new period expiry */
>> >    if (runtime_enabled && cfs_b->timer_active) {
>> >            /* force a reprogram */
>> > -          cfs_b->timer_active = 0;
>> > -          __start_cfs_bandwidth(cfs_b);
>> > +          __start_cfs_bandwidth(cfs_b, true);
>> >    }
>> >    raw_spin_unlock_irq(&cfs_b->lock);
>> >  
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 7570dd9..55a0a5b 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq 
>> > *cfs_rq)
>> >             */
>> >            if (!cfs_b->timer_active) {
>> >                    __refill_cfs_bandwidth_runtime(cfs_b);
>> > -                  __start_cfs_bandwidth(cfs_b);
>> > +                  __start_cfs_bandwidth(cfs_b, false);
>> >            }
>> >  
>> >            if (cfs_b->runtime > 0) {
>> > @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> >    raw_spin_lock(&cfs_b->lock);
>> >    list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>> >    if (!cfs_b->timer_active)
>> > -          __start_cfs_bandwidth(cfs_b);
>> > +          __start_cfs_bandwidth(cfs_b, false);
>> >    raw_spin_unlock(&cfs_b->lock);
>> >  }
>> >  
>> > @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq 
>> > *cfs_rq)
>> >  }
>> >  
>> >  /* requires cfs_b->lock, may release to reprogram timer */
>> > -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> > +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
>> >  {
>> >    /*
>> >     * The timer may be active because we're trying to set a new bandwidth
>> > @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth 
>> > *cfs_b)
>> >            cpu_relax();
>> >            raw_spin_lock(&cfs_b->lock);
>> >            /* if someone else restarted the timer then we're done */
>> > -          if (cfs_b->timer_active)
>> > +          if (!force && cfs_b->timer_active)
>> >                    return;
>> >    }
>> >  
>> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> > index 456e492..369b4d6 100644
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth 
>> > *cfs_b);
>> >  extern int sched_group_set_shares(struct task_group *tg, unsigned long 
>> > shares);
>> >  
>> >  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>> > -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>> > +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool 
>> > force);
>> >  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>> >  
>> >  extern void free_rt_sched_group(struct task_group *tg);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

Reply via email to