On Fri, 7 Sep 2018 at 14:56, Peter Zijlstra <pet...@infradead.org> wrote: > > On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote: > > Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit : > > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote: > > > > It can happen that load_balance finds a busiest group and then a > > > > busiest rq > > > > but the calculated imbalance is in fact null. > > > > > > Cute. Does that happen often? > > > > I have a use case with RT tasks that reproduces the problem regularly. > > It happens at least when we have CPUs with different capacity either because > > of heterogeous CPU or because of RT/DL reducing available capacity for cfs > > I have put the call path that trigs the problem below and accroding to the > > comment it seems that we can reach similar state when playing with priority. > > > > > > > > > If the calculated imbalance is null, it's useless to try to find a > > > > busiest > > > > rq as no task will be migrated and we can return immediately. > > > > > > > > This situation can happen with heterogeneous system or smp system when > > > > RT > > > > tasks are decreasing the capacity of some CPUs. > > > > > > Is it the result of one of those "force_balance" conditions in > > > find_busiest_group() ? Should we not fix that to then return NULL > > > instead? > > > > The UC is: > > We have a newly_idle load balance that is triggered when RT task becomes > > idle > > ( but I think that I have seen that with idle load balance too) > > > > we trigs: > > if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && > > busiest->group_no_capacity) > > goto force_balance; > > > > In calculate_imbalance we use the path > > /* > > * Avg load of busiest sg can be less and avg load of local sg can > > * be greater than avg load across all sgs of sd because avg load > > * factors in sg capacity and sgs with smaller group_type are > > * skipped when updating the busiest sg: > > */ > > if (busiest->avg_load <= sds->avg_load || > > local->avg_load >= sds->avg_load) { > > env->imbalance = 0; > > return fix_small_imbalance(env, sds); > > } > > > > but fix_small_imbalance finally decides to return without modifying > > imbalance > > like here > > if (busiest->avg_load + scaled_busy_load_per_task >= > > local->avg_load + (scaled_busy_load_per_task * imbn)) { > > env->imbalance = busiest->load_per_task; > > return; > > } > > That one actually does modify imbalance :-) But I get your point. > > > Beside this patch, I'm preparing another patch in fix small imbalance to > > ensure 1 task per CPU in similar situation but according to the comment > > above, > > we can reach this situation because of tasks priority > > Didn't we all hate fix_small_imbalance() ?
yes. the rational of some decisions in the function are more and more opaque to me. For now I'm trying to fix all UCs that I can find to then try to get a generic rule As an example, I have some situations (other than those discussed here) in fix_small_imbalance where the task migration decision mainly depends of a variation of +/-5 in the load of a task. This finally generates good fairness between tasks but the root cause of this fairness is a bit random. > > Anyway, I think I'd prefer something like the below; although it might > be nicer to thread the return value through calculate_imbalance() and > fix_small_imbalance(), but looking at them that's not going to be > particularly nicer. > > Do you agree with this?, If so, I'll stick your orignal Changelog on it > and pretend this is what you send me :-) That's fine to me :-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b39fb596f6c1..0596a29f3d2a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct > lb_env *env) > force_balance: > /* Looks like there is an imbalance. Compute it */ > calculate_imbalance(env, &sds); > - return sds.busiest; > + return env->imbalance ? sds.busiest : NULL; > > out_balanced: > env->imbalance = 0;