Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

2021-03-23 Thread Valentin Schneider
On 19/03/21 16:19, Vincent Guittot wrote:
> On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
>  wrote:
>> As stated the current behaviour is to classify groups as group_misfit_task
>> regardless of the dst_cpu's capacity. When we see a group_misfit_task
>> candidate group misfit task with higher per-CPU capacity than the local
>> group, we don't pick it as busiest.
>>
>> I initially thought not marking those as group_misfit_task was the right
>> thing to do, as they could then be classified as group_fully_busy or
>> group_has_spare. Consider:
>>
>>   DIE [  ]
>>   MC  [][]
>>0  1  2  3
>>L  L  B  B
>>
>>   arch_scale_capacity(L) < arch_scale_capacity(B)
>>
>>   CPUs 0-1 are idle / lightly loaded
>>   CPU2 has a misfit task and a few very small tasks
>>   CPU3 has a few very small tasks
>>
>> When CPU0 is running load_balance() at DIE level, right now we'll classify
>> the [2-3] group as group_misfit_task and not pick it as busiest because the
>> local group has a lower CPU capacity.
>>
>> If we didn't do that, we could leave the misfit task alone and pull some
>> small task(s) from CPU2 or CPU3, which would be a good thing to
>
> Are you sure? the last check in update_sd_pick_busiest() should
> already filter this. So it should be enough to let it be classify
> correctly
>
> A group should be classified as group_misfit_task when there is a task
> to migrate in priority compared to some other groups. In your case,
> you tag it as group_misfit_task but in order to do the opposite, i.e.
> make sure to not select it. As mentioned above, this will be filter in
> the last check in update_sd_pick_busiest()
>

This hinges on sgc->min_capacity, which might be influenced by a CPU in the
candidate group being severely pressured by IRQ / thermal / RT / DL
pressure. That said, you have a point in that this check and the one in
find_busiest_queue() catches most scenarios I can think of.

Let me ponder about this some more, and if throw it at the test
infrastructure monster if I go down that route.


Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

2021-03-19 Thread Vincent Guittot
On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
 wrote:
>
> On 15/03/21 16:13, Vincent Guittot wrote:
> > On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> >  wrote:
> >>
> >> Consider the following (hypothetical) asymmetric CPU capacity topology,
> >> with some amount of capacity pressure (RT | DL | IRQ | thermal):
> >>
> >>   DIE [  ]
> >>   MC  [][]
> >>0  1  2  3
> >>
> >>   | CPU | capacity_orig | capacity |
> >>   |-+---+--|
> >>   |   0 |   870 |  860 |
> >>   |   1 |   870 |  600 |
> >>   |   2 |  1024 |  850 |
> >>   |   3 |  1024 |  860 |
> >>
> >> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> >> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> >> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> >> CPU1's misfit task. This would then fall on CPU2 to pull the task.
> >>
> >> This currently won't happen, because CPU2 will fail
> >>
> >>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
> >
> > which has been introduced by the previous patch: patch5
> >
> >>
> >> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> >> level. In this case, the max_capacity is that of CPU0's, which is at this
> >> point in time greater than that of CPU2's. This comparison doesn't make
> >> much sense, given that the only CPUs we should care about in this scenario
> >> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> >> destination CPU).
> >>
> >> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> >> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> >> sched_group contains a misfit task to still classify it as
> >> group_misfit_task and not pick it as busiest group when pulling from a
> >
> > Could you give more details about why we should keep tracking the
> > group as misfit ? Do you have a UC in mind ?
> >
>
> As stated the current behaviour is to classify groups as group_misfit_task
> regardless of the dst_cpu's capacity. When we see a group_misfit_task
> candidate group misfit task with higher per-CPU capacity than the local
> group, we don't pick it as busiest.
>
> I initially thought not marking those as group_misfit_task was the right
> thing to do, as they could then be classified as group_fully_busy or
> group_has_spare. Consider:
>
>   DIE [  ]
>   MC  [][]
>0  1  2  3
>L  L  B  B
>
>   arch_scale_capacity(L) < arch_scale_capacity(B)
>
>   CPUs 0-1 are idle / lightly loaded
>   CPU2 has a misfit task and a few very small tasks
>   CPU3 has a few very small tasks
>
> When CPU0 is running load_balance() at DIE level, right now we'll classify
> the [2-3] group as group_misfit_task and not pick it as busiest because the
> local group has a lower CPU capacity.
>
> If we didn't do that, we could leave the misfit task alone and pull some
> small task(s) from CPU2 or CPU3, which would be a good thing to

Are you sure? the last check in update_sd_pick_busiest() should
already filter this. So it should be enough to let it be classify
correctly

A group should be classified as group_misfit_task when there is a task
to migrate in priority compared to some other groups. In your case,
you tag it as group_misfit_task but in order to do the opposite, i.e.
make sure to not select it. As mentioned above, this will be filter in
the last check in update_sd_pick_busiest()

> do. However, by allowing a group containing a misfit task to be picked as
> the busiest group when a CPU of lower capacity is pulling, we run the risk
> of the misfit task itself being downmigrated - e.g. if we repeatedly
> increment the sd->nr_balance_failed counter and do an active balance (maybe
> because the small tasks were unfortunately cache_hot()).
>
> It's less than ideal, but I considered not downmigrating misfit tasks was
> the thing to prioritize (and FWIW it also maintains current behaviour).
>
>
> Another approach would be to add task utilization vs CPU capacity checks in
> detach_tasks() and need_active_balance() to prevent downmigration when
> env->imbalance_type < group_misfit_task. This may go against the busiest
> group selection heuristics however (misfit tasks could be the main
> contributors to the imbalance, but we end up not moving them).
>
>
> >> lower-capacity CPU (which is the current behaviour and prevents
> >> down-migration).
> >>
> >> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> >> than the local CPU's, add a capacity check there.
> >>
> >> Reviewed-by: Qais Yousef 
> >> Signed-off-by: Valentin Schneider 
>
> >> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct 
> >> lb_env *env,
> >> continue;
> >>
> >> /* Check for a misfit task on the cpu */
> >> -   if (sd_has_asym_cpucapacity(env->sd) &&
> >> -   

Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

2021-03-15 Thread Valentin Schneider
On 15/03/21 16:13, Vincent Guittot wrote:
> On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
>  wrote:
>>
>> Consider the following (hypothetical) asymmetric CPU capacity topology,
>> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>>
>>   DIE [  ]
>>   MC  [][]
>>0  1  2  3
>>
>>   | CPU | capacity_orig | capacity |
>>   |-+---+--|
>>   |   0 |   870 |  860 |
>>   |   1 |   870 |  600 |
>>   |   2 |  1024 |  850 |
>>   |   3 |  1024 |  860 |
>>
>> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
>> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
>> sufficiently busy, i.e. don't have enough spare capacity to accommodate
>> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>>
>> This currently won't happen, because CPU2 will fail
>>
>>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
>
> which has been introduced by the previous patch: patch5
>
>>
>> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
>> level. In this case, the max_capacity is that of CPU0's, which is at this
>> point in time greater than that of CPU2's. This comparison doesn't make
>> much sense, given that the only CPUs we should care about in this scenario
>> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
>> destination CPU).
>>
>> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
>> env->dst_cpu would grant it a capacity uplift. Separately track whether a
>> sched_group contains a misfit task to still classify it as
>> group_misfit_task and not pick it as busiest group when pulling from a
>
> Could you give more details about why we should keep tracking the
> group as misfit ? Do you have a UC in mind ?
>

As stated the current behaviour is to classify groups as group_misfit_task
regardless of the dst_cpu's capacity. When we see a group_misfit_task
candidate group misfit task with higher per-CPU capacity than the local
group, we don't pick it as busiest.

I initially thought not marking those as group_misfit_task was the right
thing to do, as they could then be classified as group_fully_busy or
group_has_spare. Consider:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_capacity(L) < arch_scale_capacity(B)

  CPUs 0-1 are idle / lightly loaded
  CPU2 has a misfit task and a few very small tasks
  CPU3 has a few very small tasks

When CPU0 is running load_balance() at DIE level, right now we'll classify
the [2-3] group as group_misfit_task and not pick it as busiest because the
local group has a lower CPU capacity.

If we didn't do that, we could leave the misfit task alone and pull some
small task(s) from CPU2 or CPU3, which would be a good thing to
do. However, by allowing a group containing a misfit task to be picked as
the busiest group when a CPU of lower capacity is pulling, we run the risk
of the misfit task itself being downmigrated - e.g. if we repeatedly
increment the sd->nr_balance_failed counter and do an active balance (maybe
because the small tasks were unfortunately cache_hot()).

It's less than ideal, but I considered not downmigrating misfit tasks was
the thing to prioritize (and FWIW it also maintains current behaviour).


Another approach would be to add task utilization vs CPU capacity checks in
detach_tasks() and need_active_balance() to prevent downmigration when
env->imbalance_type < group_misfit_task. This may go against the busiest
group selection heuristics however (misfit tasks could be the main
contributors to the imbalance, but we end up not moving them).


>> lower-capacity CPU (which is the current behaviour and prevents
>> down-migration).
>>
>> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
>> than the local CPU's, add a capacity check there.
>>
>> Reviewed-by: Qais Yousef 
>> Signed-off-by: Valentin Schneider 

>> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env 
>> *env,
>> continue;
>>
>> /* Check for a misfit task on the cpu */
>> -   if (sd_has_asym_cpucapacity(env->sd) &&
>> -   sgs->group_misfit_task_load < rq->misfit_task_load) {
>> -   sgs->group_misfit_task_load = rq->misfit_task_load;
>> -   *sg_status |= SG_OVERLOAD;
>> +   if (!sd_has_asym_cpucapacity(env->sd) ||
>> +   !rq->misfit_task_load)
>> +   continue;
>> +
>> +   *sg_status |= SG_OVERLOAD;
>> +   sgs->group_has_misfit_task = true;
>> +
>> +   /*
>> +* Don't attempt to maximize load for misfit tasks that 
>> can't be
>> +* granted a CPU capacity uplift.
>> +*/
>> +   if (cpu_capacity_greater(env->dst_cpu, i)) {
>> +   

Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

2021-03-15 Thread Vincent Guittot
On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
 wrote:
>
> Consider the following (hypothetical) asymmetric CPU capacity topology,
> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>
>   DIE [  ]
>   MC  [][]
>0  1  2  3
>
>   | CPU | capacity_orig | capacity |
>   |-+---+--|
>   |   0 |   870 |  860 |
>   |   1 |   870 |  600 |
>   |   2 |  1024 |  850 |
>   |   3 |  1024 |  860 |
>
> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>
> This currently won't happen, because CPU2 will fail
>
>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

which has been introduced by the previous patch: patch5

>
> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> level. In this case, the max_capacity is that of CPU0's, which is at this
> point in time greater than that of CPU2's. This comparison doesn't make
> much sense, given that the only CPUs we should care about in this scenario
> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> destination CPU).
>
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> sched_group contains a misfit task to still classify it as
> group_misfit_task and not pick it as busiest group when pulling from a

Could you give more details about why we should keep tracking the
group as misfit ? Do you have a UC in mind ?

> lower-capacity CPU (which is the current behaviour and prevents
> down-migration).
>
> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> than the local CPU's, add a capacity check there.
>
> Reviewed-by: Qais Yousef 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 39 ++-
>  1 file changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e8a242cd1f7..41cdda7a8ea6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity;
>  }
>
> +/* Is CPU a's capacity noticeably greater than CPU b's? */
> +static inline bool cpu_capacity_greater(int a, int b)
> +{
> +   return capacity_greater(capacity_of(a), capacity_of(b));
> +}
> +
>  static void record_wakee(struct task_struct *p)
>  {
> /*
> @@ -8091,7 +8097,8 @@ struct sg_lb_stats {
> unsigned int group_weight;
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to 
> preferred CPU */
> -   unsigned long group_misfit_task_load; /* A CPU has a task too big for 
> its capacity */
> +   unsigned long group_misfit_task_load; /* Task load that can be 
> uplifted */
> +   int   group_has_misfit_task; /* A CPU has a task too big for 
> its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> @@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
> if (sgs->group_asym_packing)
> return group_asym_packing;
>
> -   if (sgs->group_misfit_task_load)
> +   if (sgs->group_has_misfit_task)
> return group_misfit_task;
>
> if (!group_has_capacity(imbalance_pct, sgs))
> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
> continue;
>
> /* Check for a misfit task on the cpu */
> -   if (sd_has_asym_cpucapacity(env->sd) &&
> -   sgs->group_misfit_task_load < rq->misfit_task_load) {
> -   sgs->group_misfit_task_load = rq->misfit_task_load;
> -   *sg_status |= SG_OVERLOAD;
> +   if (!sd_has_asym_cpucapacity(env->sd) ||
> +   !rq->misfit_task_load)
> +   continue;
> +
> +   *sg_status |= SG_OVERLOAD;
> +   sgs->group_has_misfit_task = true;
> +
> +   /*
> +* Don't attempt to maximize load for misfit tasks that can't 
> be
> +* granted a CPU capacity uplift.
> +*/
> +   if (cpu_capacity_greater(env->dst_cpu, i)) {
> +   sgs->group_misfit_task_load = max(
> +   sgs->group_misfit_task_load,
> +   rq->misfit_task_load);

Please encapsulate all this misfit specific code in a dedicated
function which will be called from update_sg_lb_stats

> }
> }
>
> @@ -8501,7 +8519,7 @@ static bool 

[PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

2021-03-11 Thread Valentin Schneider
Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):

  DIE [  ]
  MC  [][]
   0  1  2  3

  | CPU | capacity_orig | capacity |
  |-+---+--|
  |   0 |   870 |  860 |
  |   1 |   870 |  600 |
  |   2 |  1024 |  850 |
  |   3 |  1024 |  860 |

If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.

This currently won't happen, because CPU2 will fail

  capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).

Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift. Separately track whether a
sched_group contains a misfit task to still classify it as
group_misfit_task and not pick it as busiest group when pulling from a
lower-capacity CPU (which is the current behaviour and prevents
down-migration).

Since find_busiest_queue() can now iterate over CPUs with a higher capacity
than the local CPU's, add a capacity check there.

Reviewed-by: Qais Yousef 
Signed-off-by: Valentin Schneider 
---
 kernel/sched/fair.c | 39 ++-
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e8a242cd1f7..41cdda7a8ea6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
 }
 
+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+   return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
 static void record_wakee(struct task_struct *p)
 {
/*
@@ -8091,7 +8097,8 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred 
CPU */
-   unsigned long group_misfit_task_load; /* A CPU has a task too big for 
its capacity */
+   unsigned long group_misfit_task_load; /* Task load that can be uplifted 
*/
+   int   group_has_misfit_task; /* A CPU has a task too big for 
its capacity */
 #ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
if (sgs->group_asym_packing)
return group_asym_packing;
 
-   if (sgs->group_misfit_task_load)
+   if (sgs->group_has_misfit_task)
return group_misfit_task;
 
if (!group_has_capacity(imbalance_pct, sgs))
@@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env 
*env,
continue;
 
/* Check for a misfit task on the cpu */
-   if (sd_has_asym_cpucapacity(env->sd) &&
-   sgs->group_misfit_task_load < rq->misfit_task_load) {
-   sgs->group_misfit_task_load = rq->misfit_task_load;
-   *sg_status |= SG_OVERLOAD;
+   if (!sd_has_asym_cpucapacity(env->sd) ||
+   !rq->misfit_task_load)
+   continue;
+
+   *sg_status |= SG_OVERLOAD;
+   sgs->group_has_misfit_task = true;
+
+   /*
+* Don't attempt to maximize load for misfit tasks that can't be
+* granted a CPU capacity uplift.
+*/
+   if (cpu_capacity_greater(env->dst_cpu, i)) {
+   sgs->group_misfit_task_load = max(
+   sgs->group_misfit_task_load,
+   rq->misfit_task_load);
}
}
 
@@ -8501,7 +8519,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Don't try to pull misfit tasks we can't help */
if (static_branch_unlikely(_asym_cpucapacity) &&
sgs->group_type == group_misfit_task &&
-   (!capacity_greater(capacity_of(env->dst_cpu), 
sg->sgc->max_capacity) ||
+   (!sgs->group_misfit_task_load ||
 sds->local_stat.group_type != group_has_spare))
return false;
 
@@ -9448,15 +9466,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
case migrate_misfit: