On 02/07/20 15:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.

Nasty one, that...

Random thought: isn't that the kind of thing we have scale_load() and
scale_load_down() for? There's more uses of task_h_load() than I would like
for this, but if we upscale its output (or introduce an upscaled variant),
we could do something like:

---
detach_tasks()
{
        long imbalance = env->imbalance;

        if (env->migration_type == migrate_load)
                imbalance = scale_load(imbalance);

        while (!list_empty(tasks)) {
                /* ... */
                switch (env->migration_type) {
                case migrate_load:
                        load = task_h_load_upscaled(p);
                        /* ... usual bits here ...*/
                        lsub_positive(&env->imbalance, load);
                        break;
                        /* ... */
                }

                if (!scale_load_down(env->imbalance))
                        break;
        }
}
---

It's not perfect, and there's still the misfit situation to sort out -
still, do you think this is something we could go towards?

>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
>  kernel/sched/fair.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct 
> task_struct *p, struct rq *rq)
>               return;
>       }
>
> -     rq->misfit_task_load = task_h_load(p);
> +     /*
> +      * Make sure that misfit_task_load will not be null even if
> +      * task_h_load() returns 0. misfit_task_load is only used to select
> +      * rq with highest load so adding 1 will not modify the result
> +      * of the comparison.
> +      */
> +     rq->misfit_task_load = task_h_load(p) + 1;
>  }
>
>  #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
>                           env->sd->nr_balance_failed <= 
> env->sd->cache_nice_tries)
>                               goto next;
>
> +                     /*
> +                      * Depending of the number of CPUs and tasks and the
> +                      * cgroup hierarchy, task_h_load() can return a null
> +                      * value. Make sure that env->imbalance decreases
> +                      * otherwise detach_tasks() will stop only after
> +                      * detaching up to loop_max tasks.
> +                      */
> +                     if (!load)
> +                             load = 1;
> +
>                       env->imbalance -= load;
>                       break;

Reply via email to