Le jeudi 15 avril 2021 à 18:58:46 (+0100), Valentin Schneider a écrit : > Consider the following topology: > > DIE [ ] > MC [ ][ ] > 0 1 2 3 > > capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3}) > > w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024). > > When CPU2 goes through load_balance() (via periodic / NOHZ balance), it > should pull one CPU hog from either CPU0 or CPU1 (this is misfit task > upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before > this load_balance() happens and preempt the CPU hog running there, we would > have, for the [0-1] group at CPU2's DIE level: > > o sgs->sum_nr_running > sgs->group_weight > o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct > > IOW, this group is group_overloaded. > > Considering CPU0 is picked by find_busiest_queue(), we would then visit the > preempted CPU hog in detach_tasks(). However, given it has just been > preempted by this pcpu kworker, task_hot() will prevent it from being > detached. We then leave load_balance() without having done anything. > > Long story short, preempted misfit tasks are affected by task_hot(), while > currently running misfit tasks are intentionally preempted by the stopper > task to migrate them over to a higher-capacity CPU. > > Align detach_tasks() with the active-balance logic and let it pick a > cache-hot misfit task when the destination CPU can provide a capacity > uplift. > > Signed-off-by: Valentin Schneider <valentin.schnei...@arm.com> > --- > kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 36 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d2d1a69d7aa7..43fc98d34276 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7493,6 +7493,7 @@ struct lb_env { > enum fbq_type fbq_type; > enum migration_type migration_type; > enum group_type src_grp_type; > + enum group_type dst_grp_type; > struct list_head tasks; > }; > > @@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct > lb_env *env) > return delta < (s64)sysctl_sched_migration_cost; > } > > + > +/* > + * What does migrating this task do to our capacity-aware scheduling > criterion? > + * > + * Returns 1, if the task needs more capacity than the dst CPU can provide. > + * Returns 0, if the task needs the extra capacity provided by the dst CPU > + * Returns -1, if the task isn't impacted by the migration wrt capacity. > + */ > +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env > *env) > +{ > + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) > + return -1; > + > + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) { > + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu)) > + return 0; > + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu)) > + return 1; > + else > + return -1; > + }
Being there means that task fits src_cpu capacity so why testing p against dst_cpu ? > + > + return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1; > +} I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater. static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env) { unsigned long src_capacity, dst_capacity; if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) return -1; src_capacity = capacity_of(env->src_cpu); dst_capacity = capacity_of(env->dst_cpu); if (!task_fits_capacity(p, src_capacity)) { if (capacity_greater(dst_capacity, src_capacity)) return 0; else if (capacity_greater(src_capacity, dst_capacity)) return 1; else return -1; } return task_fits_capacity(p, dst_capacity) ? -1 : 1; } > + > #ifdef CONFIG_NUMA_BALANCING > /* > * Returns 1, if task migration degrades locality > @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct > lb_env *env) > if (tsk_cache_hot == -1) > tsk_cache_hot = task_hot(p, env); > > + /* > + * On a (sane) asymmetric CPU capacity system, the increase in compute > + * capacity should offset any potential performance hit caused by a > + * migration. > + */ > + if ((env->dst_grp_type == group_has_spare) && Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as stated in $subject > + !migrate_degrades_capacity(p, env)) > + tsk_cache_hot = 0; > + > if (tsk_cache_hot <= 0 || > env->sd->nr_balance_failed > env->sd->cache_nice_tries) { > if (tsk_cache_hot == 1) { > @@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct > lb_env *env) > if (!sds.busiest) > goto out_balanced; > > + env->dst_grp_type = local->group_type; > env->src_grp_type = busiest->group_type; > > /* Misfit tasks should be dealt with regardless of the avg load */ > -- > 2.25.1 >