Currently, we only pull tasks if the destination cpu group load is below the average over the domain being rebalanced. This sounds reasonable, but only as long as there's no pinned tasks, otherwise we can get an unfair task distribution. For instance, suppose the host has 16 cores and there's a container pinned to two of the cores (either strictly by using cpumask or indirectly by setting cpulimit). If we start 16 tasks in the container, then the average load will be 1, so that even if 15 tasks turn out to run on the same cpu (out of 2), no tasks will be pulled, which is wrong.
To overcome this issue, let's port the following patches from PCS6: diff-sched-balance-even-if-load-is-greater-than-average diff-sched-always-try-to-equalize-load-between-this-and-busiest-cpus-when-balancing They make the balance procedure pull tasks even if the destination is above average, by setting the imbalance value to be (source_load - destination_load) / 2 instead of (average_load - destination_load) / 2 This implies decreasing the convergence speed of the balancing procedure, but PCS6 has worked like that for quite a while, so it should be fine. Signed-off-by: Vladimir Davydov <[email protected]> --- kernel/sched/fair.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cedd178f963c..685517597a30 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6618,7 +6618,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* How much load to actually move to equalise the imbalance */ env->imbalance = min( max_pull * busiest->group_power, - (sds->avg_load - this->avg_load) * this->group_power + (busiest->avg_load - this->avg_load) * this->group_power ) / SCHED_POWER_SCALE; /* @@ -6695,13 +6695,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (this->avg_load >= busiest->avg_load) goto out_balanced; - /* - * Don't pull any tasks if this group is already above the domain - * average load. - */ - if (this->avg_load >= sds.avg_load) - goto out_balanced; - if (env->idle == CPU_IDLE) { /* * This cpu is idle. If the busiest group load doesn't -- 2.1.4 _______________________________________________ Devel mailing list [email protected] https://lists.openvz.org/mailman/listinfo/devel
