Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-13 Thread Xuewen Yan
Hi
> > >
> > > In fair.c:select_task_rq_fair(), if feec() returns a error (< 0), then
> > > prev_cpu is selected. I think it's better to still let feec() signal
> > > that something happened and let select_task_rq_fair() select prev_cpu by
> > > itself.
> > In fair.c:select_task_rq_fair(), when feec() returns a error (< 0),
> > the new_cpu = find_idlest_cpu() or select_idle_sibling().
> > In your patch, you should return prev_cpu instead of -1 if you want to
> > select the prev_cpu.
> Having a negative delta doesn't imply that prev_cpu is still available.
> E.g.: If prev_cpu cannot receive the task (and is skipped), and a
> negative delta appears when evaluating the other CPUs. In such case
> feec() cannot select prev_cpu. dst_cpu must be selected by another function.
In this case, would it be better to add a condition "prev_delta == ULONG_MAX" ?
Returnig(-1)  could avoid the negative delta, but I still think this
is not the fundamental way to solve the problem.
But I think you can send a V2 with the bail out mechanism.

Regards


Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-12 Thread Xuewen Yan
Hi

On Tue, Apr 13, 2021 at 1:15 AM Pierre Gondois  wrote:
>
> Hi
> > > > >
> > > > > This patch-set is not significantly improving the execution time of
> > > > > feec(). The results we have so far are an improvement of 5-10% in
> > > > > execution time, with feec() being executed in < 10us. So the
> > gain is not
> > > > > spectacular.
> > > >
> > > > well, I meaned to cache all util value and compute energy with
> > caches,
> > > > when
> > > > (cpu==dst_cpu), use caches instead of updating util, and do not get
> > > > util with function:
> > > >  "effective_cpu_util()", to compute util with cache.
> > > > I add more parameters into pd_cache:
> > > > struct pd_cache {
> > > > unsigned long util;
> > > > unsigned long util_est;
> > > > unsigned long util_cfs;
> > > > unsigned long util_irq;
> > > > unsigned long util_rt;
> > > > unsigned long util_dl;
> > > > unsigned long bw_dl;
> > > > unsigned long freq_util;
> > > > unsigned long nrg_util;
> > > > };
> > > > In this way, it can avoid util update while feec. I tested with it,
> > > > and the negative delta disappeared.
> > > > Maybe this is not a good method, but it does work.
> > > If I understand correctly, you put all the fields used by
> > > core.c:effective_cpu_util() in the caches, allowing to have values not
> > > subject to updates.
> > Yes.
> > > core.c:effective_cpu_util() isn't only called from
> > > fair.c:compute_energy(). It is used in the cpufreq_schedutil.c and
> > > cpufreq_cooling.c (through core.c:sched_cpu_util()).
> > > Did you have to duplicate core.c:effective_cpu_util() to have a second
> > > version using the caches ? If yes, I think the function was meant to be
> > > unique so that all the utilization estimations go through the same path.
> > >
> > I defined a new function to distinguish it from the effective_cpu_util.
> >
> > > If your concern is to avoid negative delta, I think just bailing out
> > > when this happens should be sufficient. As shown in the last message,
> > > having a wrong placement should not happen that often, plus the prev_cpu
> > > should be used which should be ok.
> > In your patch, you didn't actually choose the prev_cpu. you return (-1);
> >
> > > If you want to cache the values, I think a stronger justification will
> > > be asked: this seems to be a big modification compared to the initial
> > > issue, knowing that another simpler solution is available (i.e. bailing
> > > out). I was not able to prove there was a significant gain in the
> > > find_energy_efficient_cpu() execution time, but I would be happy if you
> > > can, or if you find other arguments.
> > Yes, you are right, perhaps there is indeed no need for such a big
> > modification.
> >
> > Regards
>
> In fair.c:select_task_rq_fair(), if feec() returns a error (< 0), then
> prev_cpu is selected. I think it's better to still let feec() signal
> that something happened and let select_task_rq_fair() select prev_cpu by
> itself.
In fair.c:select_task_rq_fair(), when feec() returns a error (< 0),
the new_cpu = find_idlest_cpu() or select_idle_sibling().
In your patch, you should return prev_cpu instead of -1 if you want to
select the prev_cpu.

> Are you planning to submit a V2 with the bail out mechanism ?
Maybe it would be better if you submitted the V2 ? I just check the
patch and raised some questions.
>
Regards


Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-12 Thread Xuewen Yan
Hi
>
> Hi,
> > >
> > > This patch-set is not significantly improving the execution time of
> > > feec(). The results we have so far are an improvement of 5-10% in
> > > execution time, with feec() being executed in < 10us. So the gain is not
> > > spectacular.
> >
> > well, I meaned to cache all util value and compute energy with caches,
> > when
> > (cpu==dst_cpu), use caches instead of updating util, and do not get
> > util with function:
> >  "effective_cpu_util()", to compute util with cache.
> > I add more parameters into pd_cache:
> > struct pd_cache {
> > unsigned long util;
> > unsigned long util_est;
> > unsigned long util_cfs;
> > unsigned long util_irq;
> > unsigned long util_rt;
> > unsigned long util_dl;
> > unsigned long bw_dl;
> > unsigned long freq_util;
> > unsigned long nrg_util;
> > };
> > In this way, it can avoid util update while feec. I tested with it,
> > and the negative delta disappeared.
> > Maybe this is not a good method, but it does work.
> If I understand correctly, you put all the fields used by
> core.c:effective_cpu_util() in the caches, allowing to have values not
> subject to updates.
Yes.
> core.c:effective_cpu_util() isn't only called from
> fair.c:compute_energy(). It is used in the cpufreq_schedutil.c and
> cpufreq_cooling.c (through core.c:sched_cpu_util()).
> Did you have to duplicate core.c:effective_cpu_util() to have a second
> version using the caches ? If yes, I think the function was meant to be
> unique so that all the utilization estimations go through the same path.
>
I defined a new function to distinguish it from the effective_cpu_util.

> If your concern is to avoid negative delta, I think just bailing out
> when this happens should be sufficient. As shown in the last message,
> having a wrong placement should not happen that often, plus the prev_cpu
> should be used which should be ok.
In your patch, you didn't actually choose the prev_cpu. you return (-1);

> If you want to cache the values, I think a stronger justification will
> be asked: this seems to be a big modification compared to the initial
> issue, knowing that another simpler solution is available (i.e. bailing
> out). I was not able to prove there was a significant gain in the
> find_energy_efficient_cpu() execution time, but I would be happy if you
> can, or if you find other arguments.
Yes, you are right, perhaps there is indeed no need for such a big modification.

Regards


Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-08 Thread Xuewen Yan
Hi
>
> Hi,
> > Hi
> >
> > On Wed, Apr 7, 2021 at 10:11 PM Pierre  wrote:
> > >
> > > Hi,
> > > > I test the patch, but the overflow still exists.
> > > > In the "sched/fair: Use pd_cache to speed up
> > find_energy_efficient_cpu()"
> > > > I wonder why recompute the cpu util when cpu==dst_cpu in
> > compute_energy(),
> > > > when the dst_cpu's util change, it also would cause the overflow.
> > >
> > > The patches aim to cache the energy values for the CPUs whose
> > > utilization is not modified (so we don't have to compute it multiple
> > > times). The values cached are the 'base values' of the CPUs, i.e. when
> > > the task is not placed on the CPU. When (cpu==dst_cpu) in
> > > compute_energy(), it means the energy values need to be updated instead
> > > of using the cached ones.
> > >
> > well, is it better to use the task_util(p) + cache values ? but in
> > this case, the cache
> > values may need more parameters.
>
> This patch-set is not significantly improving the execution time of
> feec(). The results we have so far are an improvement of 5-10% in
> execution time, with feec() being executed in < 10us. So the gain is not
> spectacular.

well, I meaned to cache all util value and compute energy with caches, when
(cpu==dst_cpu), use caches instead of updating util, and do not get
util with function:
 "effective_cpu_util()", to compute util with cache.
I add more parameters into pd_cache:
struct pd_cache {
unsigned long util;
unsigned long util_est;
unsigned long util_cfs;
unsigned long util_irq;
unsigned long util_rt;
unsigned long util_dl;
unsigned long bw_dl;
unsigned long freq_util;
unsigned long nrg_util;
};
In this way, it can avoid util update while feec. I tested with it,
and the negative delta disappeared.
Maybe this is not a good method, but it does work.
>
> >
> > > You are right, there is still a possibility to have a negative delta
> > > with the patches at:
> > >
> > https://gitlab.arm.com/linux-arm/linux-power/-/commits/eas/next/integration-20210129
> > 
> > > Adding a check before subtracting the values, and bailing out in such
> > > case would avoid this, such as at:
> > > https://gitlab.arm.com/linux-arm/linux-pg/-/commits/feec_bail_out/
> > 
> > >
> > In your patch, you bail out the case by "go to fail", that means you
> > don't use eas in such
> > case. However, in the actual scene, the case often occurr when select
> > cpu for small task.
> > As a result, the small task would not select cpu according to the eas,
> > it may affect
> > power consumption?
> With this patch (bailing out), the percentage of feec() returning due to
> a negative delta I get are:
> on a Juno-r2, with 2 big CPUs and 4 CPUs (capacity of 383), with a
> workload running during 5s with task having a period of 16 ms and:
>   - 50 tasks at 1%:   0.14%
>   - 30 tasks at 1%:   0.54%
>   - 10 tasks at 1%: < 0.1%
>   - 30 tasks at 5%: < 0.1%
>   - 10 tasks at 5%: < 0.1%
> It doesn't happen so often to me.If we bail out of feec(), the task will
> still have another opportunity in the next call. However I agree this
> can lead to a bad placement when this happens.
> >
> > > I think a similar modification should be done in your patch. Even though
> > > this is a good idea to group the calls to compute_energy() to reduce the
> > > chances of having updates of utilization values in between the
> > > compute_energy() calls,
> > > there is still a chance to have updates. I think it happened when I
> > > applied your patch.
> > >
> > > About changing the delta(s) from 'unsigned long' to 'long', I am not
> > > sure of the meaning of having a negative delta. I thing it would be
> > > better to check and fail before it happens instead.
> > >
> > > Regards
> > >
>
>
>


Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-07 Thread Xuewen Yan
Hi

On Wed, Apr 7, 2021 at 10:11 PM Pierre  wrote:
>
> Hi,
> > I test the patch, but the overflow still exists.
> > In the "sched/fair: Use pd_cache to speed up find_energy_efficient_cpu()"
> > I wonder why recompute the cpu util when cpu==dst_cpu in compute_energy(),
> > when the dst_cpu's util change, it also would cause the overflow.
>
> The patches aim to cache the energy values for the CPUs whose
> utilization is not modified (so we don't have to compute it multiple
> times). The values cached are the 'base values' of the CPUs, i.e. when
> the task is not placed on the CPU. When (cpu==dst_cpu) in
> compute_energy(), it means the energy values need to be updated instead
> of using the cached ones.
>
well, is it better to use the task_util(p) + cache values ? but in
this case, the cache
values may need more parameters.

> You are right, there is still a possibility to have a negative delta
> with the patches at:
> https://gitlab.arm.com/linux-arm/linux-power/-/commits/eas/next/integration-20210129
> Adding a check before subtracting the values, and bailing out in such
> case would avoid this, such as at:
> https://gitlab.arm.com/linux-arm/linux-pg/-/commits/feec_bail_out/
>
In your patch, you bail out the case by "go to fail", that means you
don't use eas in such
case. However, in the actual scene, the case often occurr when select
cpu for small task.
As a result, the small task would not select cpu according to the eas,
it may affect
power consumption?

> I think a similar modification should be done in your patch. Even though
> this is a good idea to group the calls to compute_energy() to reduce the
> chances of having updates of utilization values in between the
> compute_energy() calls,
> there is still a chance to have updates. I think it happened when I
> applied your patch.
>
> About changing the delta(s) from 'unsigned long' to 'long', I am not
> sure of the meaning of having a negative delta. I thing it would be
> better to check and fail before it happens instead.
>
> Regards
>


Re: [PATCH] sched/fair: use signed long when compute energy delta in eas

2021-04-06 Thread Xuewen Yan
Hi Dietmar

On Fri, Apr 2, 2021 at 2:07 AM Dietmar Eggemann
 wrote:
>
> +cc: Pierre Gondois 
>
> On 30/03/2021 11:45, Quentin Perret wrote:
> > Hi,
> >
> > On Tuesday 30 Mar 2021 at 13:21:54 (+0800), Xuewen Yan wrote:
> >> From: Xuewen Yan 
> >>
> >> now the energy delta compute as follow:
> >>
> >> base_energy_pd = compute_energy(p, -1, pd);
> >>  --->Traverse all CPUs in pd
> >>  --->em_pd_energy()
> >> - \
> >> search for the max_sapre_cap_cpu   \
> >> -   search time
> >> cur_delta = compute_energy(p, max_spare_cap_cpu, pd);  /
> >>  --->Traverse all CPUs in pd   /
> >>  /
> >>  --->em_pd_energy()
> >> cur_delta -= base_energy_pd;
> >>
> >> During the search_time, or when calculate the cpu_util in
> >> compute_energy(), there may occurred task dequeue or cpu_util change,
> >> it may cause the cur_energy < base_energy_pd, so the cur_delta
> >> would be negative. But the cur_delta is unsigned long, at this time,
> >> the cur_delta would always bigger than best_delta of last pd.
> >>
> >> Change the vars to signed long.
> >
> > Is that really helping though?
> >
> > Yes you will not overflow, but the decision is still 'wrong' if the util
> > values are not stable for the entire wake-up. I think folks on the Arm
> > side had patches to try and cache the util values upfront, and then use
> > them throughout feec() and compute_energy(), which I think would be a
> > better fix.
> >
> > Dietmar, wdyt?
>
> Yes, we have some patches from Pierre Gondois which introduce a pd cache
> to store the CPU utilization values so they can be reused for 'cpu !=
> dst_cpu' calculations within find_energy_efficient_cpu() (feec()).
>
> We did run them in our Jan 2021 EAS integration:
>
> https://gitlab.arm.com/linux-arm/linux-power/-/commits/eas/next/integration-20210129
>
>   sched: Allocate pd_cache when EAS is enabled
>   sched/fair: Use pd_cache to speed up find_energy_efficient_cpu()
>
> We haven't posted them since we're still looking for a story to justify
> the extra complexity. The experiments on Arm64 Juno (2 big, 4 little
> CPUs) showed 1-2% failure due to changes of CPU utilization values
> during feec(). There was a 5% (big CPU)-10% (little CPU) runtime
> reduction for feec() with the patches.

I test the patch, but the overflow still exists.
In the "sched/fair: Use pd_cache to speed up find_energy_efficient_cpu()"
I wonder why recompute the cpu util when cpu==dst_cpu in compute_energy(),
when the dst_cpu's util change, it also would cause the overflow.

Thanks


[PATCH] sched/fair: use signed long when compute energy delta in eas

2021-03-29 Thread Xuewen Yan
From: Xuewen Yan 

now the energy delta compute as follow:

base_energy_pd = compute_energy(p, -1, pd);
--->Traverse all CPUs in pd
--->em_pd_energy()
- \
search for the max_sapre_cap_cpu   \
-   search time
cur_delta = compute_energy(p, max_spare_cap_cpu, pd);  /
--->Traverse all CPUs in pd   /
 /
--->em_pd_energy()
cur_delta -= base_energy_pd;

During the search_time, or when calculate the cpu_util in
compute_energy(), there may occurred task dequeue or cpu_util change,
it may cause the cur_energy < base_energy_pd, so the cur_delta
would be negative. But the cur_delta is unsigned long, at this time,
the cur_delta would always bigger than best_delta of last pd.

Change the vars to signed long.

Signed-off-by: Xuewen Yan 
---
 kernel/sched/fair.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..310d9d215cd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6586,7 +6586,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct 
perf_domain *pd)
  */
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
-   unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
+   long prev_delta = LONG_MAX, best_delta = LONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
unsigned long cpu_cap, util, base_energy = 0;
int cpu, best_energy_cpu = prev_cpu;
@@ -6613,13 +6613,11 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
goto unlock;
 
for (; pd; pd = pd->next) {
-   unsigned long cur_delta, spare_cap, max_spare_cap = 0;
+   long cur_delta;
+   unsigned long spare_cap, max_spare_cap = 0;
unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;
-
-   /* Compute the 'base' energy of the pd, without @p */
-   base_energy_pd = compute_energy(p, -1, pd);
-   base_energy += base_energy_pd;
+   bool prev_in_pd = false;
 
for_each_cpu_and(cpu, perf_domain_span(pd), 
sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
@@ -6641,13 +6639,8 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (!fits_capacity(util, cpu_cap))
continue;
 
-   /* Always use prev_cpu as a candidate. */
-   if (cpu == prev_cpu) {
-   prev_delta = compute_energy(p, prev_cpu, pd);
-   prev_delta -= base_energy_pd;
-   best_delta = min(best_delta, prev_delta);
-   }
-
+   if (cpu == prev_cpu)
+   prev_in_pd = true;
/*
 * Find the CPU with the maximum spare capacity in
 * the performance domain
@@ -6658,6 +6651,16 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
}
}
 
+   /* Compute the 'base' energy of the pd, without @p */
+   base_energy_pd = compute_energy(p, -1, pd);
+   base_energy += base_energy_pd;
+
+   /* Always use prev_cpu as a candidate. */
+   if (prev_in_pd) {
+   prev_delta = compute_energy(p, prev_cpu, pd);
+   prev_delta -= base_energy_pd;
+   best_delta = min(best_delta, prev_delta);
+   }
/* Evaluate the energy impact of using this CPU. */
if (max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) {
cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
@@ -6675,7 +6678,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
 * least 6% of the energy used by prev_cpu.
 */
-   if (prev_delta == ULONG_MAX)
+   if (prev_delta == LONG_MAX)
return best_energy_cpu;
 
if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
-- 
2.29.0



[PATCH] cpufreq: Judging new_policy before update related_cpus

2021-02-02 Thread Xuewen Yan
From: Xuewen Yan 

When the policy->related_cpus are all offline, and then
bring up one cpu, this time, if the ->online is NULL,
the code would update the ->related_cpus with ->cpus,
and now ->cpus is only one online cpu, as a result, the
->related_cpus is different from the origion ->related_cpus.

Signed-off-by: Xuewen Yan 
---
 drivers/cpufreq/cpufreq.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index d0a3525ce27f..3d512ac463a5 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1374,8 +1374,10 @@ static int cpufreq_online(unsigned int cpu)
if (ret)
goto out_exit_policy;
 
-   /* related_cpus should at least include policy->cpus. */
-   cpumask_copy(policy->related_cpus, policy->cpus);
+   if (new_policy) {
+   /* related_cpus should at least include policy->cpus. */
+   cpumask_copy(policy->related_cpus, policy->cpus);
+   }
}
 
down_write(>rwsem);
-- 
2.29.0



[tip: sched/core] sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue

2021-01-14 Thread tip-bot2 for Xuewen Yan
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8c1f560c1ea3f19e22ba356f62680d9d449c9ec2
Gitweb:
https://git.kernel.org/tip/8c1f560c1ea3f19e22ba356f62680d9d449c9ec2
Author:Xuewen Yan 
AuthorDate:Fri, 18 Dec 2020 17:27:52 +08:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 14 Jan 2021 11:20:10 +01:00

sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue

CPU (root cfs_rq) estimated utilization (util_est) is currently used in
dequeue_task_fair() to drive frequency selection before it is updated.

with:

CPU_util: rq->cfs.avg.util_avg
CPU_util_est: rq->cfs.avg.util_est
CPU_utilization : max(CPU_util, CPU_util_est)
task_util   : p->se.avg.util_avg
task_util_est   : p->se.avg.util_est

dequeue_task_fair():

/* (1) CPU_util and task_util update + inform schedutil about
   CPU_utilization changes */
for_each_sched_entity() /* 2 loops */
(dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
 -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
 -> sugov_get_util() -> cpu_util_cfs()

/* (2) CPU_util_est and task_util_est update */
util_est_dequeue()

cpu_util_cfs() uses CPU_utilization which could lead to a false (too
high) utilization value for schedutil in task ramp-down or ramp-up
scenarios during task dequeue.

To mitigate the issue split the util_est update (2) into:

 (A) CPU_util_est update in util_est_dequeue()
 (B) task_util_est update in util_est_update()

Place (A) before (1) and keep (B) where (2) is. The latter is necessary
since (B) relies on task_util update in (1).

Fixes: 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")
Signed-off-by: Xuewen Yan 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/1608283672-18240-1-git-send-email-xuewen.ya...@gmail.com
---
 kernel/sched/fair.c | 43 ---
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 389cb58..40d3ebf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3943,6 +3943,22 @@ static inline void util_est_enqueue(struct cfs_rq 
*cfs_rq,
trace_sched_util_est_cfs_tp(cfs_rq);
 }
 
+static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
+   struct task_struct *p)
+{
+   unsigned int enqueued;
+
+   if (!sched_feat(UTIL_EST))
+   return;
+
+   /* Update root cfs_rq's estimated utilization */
+   enqueued  = cfs_rq->avg.util_est.enqueued;
+   enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
+   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+
+   trace_sched_util_est_cfs_tp(cfs_rq);
+}
+
 /*
  * Check if a (signed) value is within a specified (unsigned) margin,
  * based on the observation that:
@@ -3956,23 +3972,16 @@ static inline bool within_margin(int value, int margin)
return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
 }
 
-static void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+static inline void util_est_update(struct cfs_rq *cfs_rq,
+  struct task_struct *p,
+  bool task_sleep)
 {
long last_ewma_diff;
struct util_est ue;
-   int cpu;
 
if (!sched_feat(UTIL_EST))
return;
 
-   /* Update root cfs_rq's estimated utilization */
-   ue.enqueued  = cfs_rq->avg.util_est.enqueued;
-   ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
-   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
-
-   trace_sched_util_est_cfs_tp(cfs_rq);
-
/*
 * Skip update of task's estimated utilization when the task has not
 * yet completed an activation, e.g. being migrated.
@@ -4012,8 +4021,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
 * To avoid overestimation of actual task utilization, skip updates if
 * we cannot grant there is idle time in this CPU.
 */
-   cpu = cpu_of(rq_of(cfs_rq));
-   if (task_util(p) > capacity_orig_of(cpu))
+   if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq
return;
 
/*
@@ -4096,8 +4104,11 @@ static inline void
 util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 
 static inline void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
-bool task_sleep) {}
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
+   bool task_sleep) {}
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* C

[PATCH v4] sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue

2020-12-18 Thread Xuewen Yan
From: Xuewen Yan 

CPU (root cfs_rq) estimated utilization (util_est) is currently used in
dequeue_task_fair() to drive frequency selection before it is updated.

with:

CPU_util: rq->cfs.avg.util_avg
CPU_util_est: rq->cfs.avg.util_est
CPU_utilization : max(CPU_util, CPU_util_est)
task_util   : p->se.avg.util_avg
task_util_est   : p->se.avg.util_est

dequeue_task_fair():

/* (1) CPU_util and task_util update + inform schedutil about
   CPU_utilization changes */
for_each_sched_entity() /* 2 loops */
(dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
 -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
 -> sugov_get_util() -> cpu_util_cfs()

/* (2) CPU_util_est and task_util_est update */
util_est_dequeue()

cpu_util_cfs() uses CPU_utilization which could lead to a false (too
high) utilization value for schedutil in task ramp-down or ramp-up
scenarios during task dequeue.

To mitigate the issue split the util_est update (2) into:

 (A) CPU_util_est update in util_est_dequeue()
 (B) task_util_est update in util_est_update()

Place (A) before (1) and keep (B) where (2) is. The latter is necessary
since (B) relies on task_util update in (1).

Fixes: 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")

Signed-off-by: Xuewen Yan 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
---
Change since v3:
-add reviewer
-add more comment details

Changes since v2:
-modify the comment
-move util_est_dequeue above within_margin()
-modify the tab and space

Changes since v1:
-change the util_est_dequeue/update to inline type
-use unsigned int enqueued rather than util_est in util_est_dequeue
-remove "cpu" var

---
 kernel/sched/fair.c | 43 ---
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..f3a1b7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3932,6 +3932,22 @@ static inline void util_est_enqueue(struct cfs_rq 
*cfs_rq,
trace_sched_util_est_cfs_tp(cfs_rq);
 }
 
+static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
+   struct task_struct *p)
+{
+   unsigned int enqueued;
+
+   if (!sched_feat(UTIL_EST))
+   return;
+
+   /* Update root cfs_rq's estimated utilization */
+   enqueued  = cfs_rq->avg.util_est.enqueued;
+   enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
+   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+
+   trace_sched_util_est_cfs_tp(cfs_rq);
+}
+
 /*
  * Check if a (signed) value is within a specified (unsigned) margin,
  * based on the observation that:
@@ -3945,23 +3961,16 @@ static inline bool within_margin(int value, int margin)
return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
 }
 
-static void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+static inline void util_est_update(struct cfs_rq *cfs_rq,
+  struct task_struct *p,
+  bool task_sleep)
 {
long last_ewma_diff;
struct util_est ue;
-   int cpu;
 
if (!sched_feat(UTIL_EST))
return;
 
-   /* Update root cfs_rq's estimated utilization */
-   ue.enqueued  = cfs_rq->avg.util_est.enqueued;
-   ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
-   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
-
-   trace_sched_util_est_cfs_tp(cfs_rq);
-
/*
 * Skip update of task's estimated utilization when the task has not
 * yet completed an activation, e.g. being migrated.
@@ -4001,8 +4010,7 @@ static inline bool within_margin(int value, int margin)
 * To avoid overestimation of actual task utilization, skip updates if
 * we cannot grant there is idle time in this CPU.
 */
-   cpu = cpu_of(rq_of(cfs_rq));
-   if (task_util(p) > capacity_orig_of(cpu))
+   if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq
return;
 
/*
@@ -4085,8 +4093,11 @@ static inline int newidle_balance(struct rq *rq, struct 
rq_flags *rf)
 util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 
 static inline void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
-bool task_sleep) {}
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
+   bool task_sleep) {}
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* CONFIG_SMP */
@@ -5589,6 +5600,8 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
b

[PATCH v3] sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue

2020-12-18 Thread Xuewen Yan
From: Xuewen Yan 

CPU (root cfs_rq) estimated utilization (util_est) is currently used in
dequeue_task_fair() to drive frequency selection before it is updated.

with:

CPU_util: rq->cfs.avg.util_avg
CPU_util_est: rq->cfs.avg.util_est
CPU_utilization : max(CPU_util, CPU_util_est)
task_util   : p->se.avg.util_avg
task_util_est   : p->se.avg.util_est

dequeue_task_fair():

/* (1) CPU_util and task_util update + inform schedutil about
   CPU_utilization changes */
for_each_sched_entity() /* 2 loops */
(dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
 -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
 -> sugov_get_util() -> cpu_util_cfs()

/* (2) CPU_util_est and task_util_est update */
util_est_dequeue()

cpu_util_cfs() uses CPU_utilization which could lead to a false (too
high) utilization value for schedutil in task ramp-down or ramp-up
scenarios during task dequeue.

To mitigate the issue split the util_est update (2) into:

 (A) CPU_util_est update in util_est_dequeue()
 (B) task_util_est update in util_est_update()

Place (A) before (1) and keep (B) where (2) is. The latter is necessary
since (B) relies on task_util update in (1).

Signed-off-by: Xuewen Yan 
Reviewed-by: Dietmar Eggemann 
---
Changes since v2:
-modify the comment
-move util_est_dequeue above within_margin()
-modify the tab and space

Changes since v1:
-change the util_est_dequeue/update to inline type
-use unsigned int enqueued rather than util_est in util_est_dequeue
-remove "cpu" var

---
 kernel/sched/fair.c | 43 ---
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..f3a1b7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3932,6 +3932,22 @@ static inline void util_est_enqueue(struct cfs_rq 
*cfs_rq,
trace_sched_util_est_cfs_tp(cfs_rq);
 }
 
+static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
+   struct task_struct *p)
+{
+   unsigned int enqueued;
+
+   if (!sched_feat(UTIL_EST))
+   return;
+
+   /* Update root cfs_rq's estimated utilization */
+   enqueued  = cfs_rq->avg.util_est.enqueued;
+   enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
+   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+
+   trace_sched_util_est_cfs_tp(cfs_rq);
+}
+
 /*
  * Check if a (signed) value is within a specified (unsigned) margin,
  * based on the observation that:
@@ -3945,23 +3961,16 @@ static inline bool within_margin(int value, int margin)
return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
 }
 
-static void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+static inline void util_est_update(struct cfs_rq *cfs_rq,
+  struct task_struct *p,
+  bool task_sleep)
 {
long last_ewma_diff;
struct util_est ue;
-   int cpu;
 
if (!sched_feat(UTIL_EST))
return;
 
-   /* Update root cfs_rq's estimated utilization */
-   ue.enqueued  = cfs_rq->avg.util_est.enqueued;
-   ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
-   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
-
-   trace_sched_util_est_cfs_tp(cfs_rq);
-
/*
 * Skip update of task's estimated utilization when the task has not
 * yet completed an activation, e.g. being migrated.
@@ -4001,8 +4010,7 @@ static inline bool within_margin(int value, int margin)
 * To avoid overestimation of actual task utilization, skip updates if
 * we cannot grant there is idle time in this CPU.
 */
-   cpu = cpu_of(rq_of(cfs_rq));
-   if (task_util(p) > capacity_orig_of(cpu))
+   if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq
return;
 
/*
@@ -4085,8 +4093,11 @@ static inline int newidle_balance(struct rq *rq, struct 
rq_flags *rf)
 util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 
 static inline void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
-bool task_sleep) {}
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
+   bool task_sleep) {}
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* CONFIG_SMP */
@@ -5589,6 +5600,8 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
 
+   util_est_dequeue(>cfs, p);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
  

[PATCH v2] fair/util_est: fix schedutil may use an old util_est.enqueued value at dequeue

2020-12-15 Thread Xuewen Yan
From: Xuewen Yan 

when a task dequeued, it would update it's util and cfs_rq's util,
the following calling relationship exists here:

update_load_avg() -> cfs_rq_util_change() -> cpufreq_update_util()->
sugov_update_[shared\|single] -> sugov_get_util() -> cpu_util_cfs();

util = max {rq->cfs.avg.util_avg,rq->cfs.avg.util_est.enqueued };

For those tasks alternate long and short runs scenarios, the
rq->cfs.avg.util_est.enqueued may be bigger than rq->cfs.avg.util_avg.
but because the _task_util_est(p) didn't be subtracted, this would
cause schedutil use an old util_est.enqueued value.

This could have an effect in task ramp-down and ramp-up scenarios.
when the task dequeued, we hope the freq could be reduced as soon
as possible. If the schedutil's value is the old util_est.enqueued, this
may cause the cpu couldn't reduce it's freq.

separate the util_est_dequeue() into util_est_dequeue() and
util_est_update(), and dequeue the _task_util_est(p) before update util.

Signed-off-by: Xuewen Yan 
Signed-off-by: Dietmar Eggemann 

---
Changes since v1:
-change the util_est_dequeue/update to inline type
-use unsigned int enqueued rather than util_est in util_est_dequeue
-remove "cpu" var

---
 kernel/sched/fair.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..864d6b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3945,22 +3945,31 @@ static inline bool within_margin(int value, int margin)
return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
 }
 
-static void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
+   struct task_struct *p)
 {
-   long last_ewma_diff;
-   struct util_est ue;
-   int cpu;
+   unsigned int enqueued;
 
if (!sched_feat(UTIL_EST))
return;
 
/* Update root cfs_rq's estimated utilization */
-   ue.enqueued  = cfs_rq->avg.util_est.enqueued;
-   ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
-   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
+   enqueued  = cfs_rq->avg.util_est.enqueued;
+   enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
+   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
 
trace_sched_util_est_cfs_tp(cfs_rq);
+}
+
+static inline void util_est_update(struct cfs_rq *cfs_rq,
+   struct task_struct *p,
+   bool task_sleep)
+{
+   long last_ewma_diff;
+   struct util_est ue;
+
+   if (!sched_feat(UTIL_EST))
+   return;
 
/*
 * Skip update of task's estimated utilization when the task has not
@@ -4001,8 +4010,7 @@ static inline bool within_margin(int value, int margin)
 * To avoid overestimation of actual task utilization, skip updates if
 * we cannot grant there is idle time in this CPU.
 */
-   cpu = cpu_of(rq_of(cfs_rq));
-   if (task_util(p) > capacity_orig_of(cpu))
+   if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq
return;
 
/*
@@ -4085,7 +4093,10 @@ static inline int newidle_balance(struct rq *rq, struct 
rq_flags *rf)
 util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 
 static inline void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
 bool task_sleep) {}
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
@@ -5589,6 +5600,8 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
 
+   util_est_dequeue(>cfs, p);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -5639,7 +5652,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
rq->next_balance = jiffies;
 
 dequeue_throttle:
-   util_est_dequeue(>cfs, p, task_sleep);
+   util_est_update(>cfs, p, task_sleep);
hrtick_update(rq);
 }
 
-- 
1.9.1



Re: [PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change

2020-12-15 Thread Xuewen Yan
On Tue, Dec 15, 2020 at 5:39 PM Vincent Guittot
 wrote:
>
> On Mon, 14 Dec 2020 at 19:46, Dietmar Eggemann  
> wrote:
> >
> > On 11/12/2020 13:03, Ryan Y wrote:
> > > Hi Dietmar,
> > >
> > > Yes! That's exactly what I meant.
> > >
> > >> The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
> > >> cpu_util_cfs() operates on an old  cfs_rq->avg.util_est.enqueued value?
> > >
> > > well, because of this, when the p dequeued, _task_util_est(p) should be
> > > subtracted before cfs_rq_util_change().
> > > however, the original util_est_dequeue() dequeue the util_est and update
> > > the
> > > p->se.avg.util_est together.
> > > so I separate the original util_est_dequeue() to deal with the issue.
> >
> > OK, I see.
> >
> > I ran a testcase '50% periodic task 'task0-0' (8ms/16ms)' with
> > PELT + proprietary trace events within dequeue_task_fair() call:
> >
> > task0-0-1710 [002] 218.215535: sched_pelt_se:  cpu=2 path=(null) 
> > comm=task0-0 pid=1710 load=596 runnable=597 util=597 
> > update_time=218123022336
> > task0-0-1710 [002] 218.215536: sched_pelt_cfs: cpu=2 path=/ load=597 
> > runnable=597 util=597 update_time=218123022336
> > task0-0-1710 [002] 218.215538: bprint: sugov_get_util: CPU2 
> > rq->cfs.avg.util_avg=597 rq->cfs.avg.util_est.enqueued=601
> > task0-0-1710 [002] 218.215540: sched_util_est_cfs: cpu=2 path=/ enqueued=0 
> > ewma=0 util=597
> > task0-0-1710 [002] 218.215542: bprint: dequeue_task_fair: CPU2 
> > [task0-0 1710] rq->cfs.avg.util_avg=[576->597] 
> > rq->cfs.avg.util_est.enqueued=[601->0]
> >
> > It's true that 'sugov_get_util() -> cpu_util_cfs()' can use
> > rq->cfs.avg.util_est.enqueued before _task_util_est(p) is subtracted
> > from it.
> >
> > But isn't rq->cfs.avg.util_est.enqueued (in this case 601) always close
> > to rq->cfs.avg.util_avg (597) since the task was just running?
> > The cfs_rq utilization contains a blocked (sleeping) task.
>
> There will be a difference if the task alternates long and short runs
> in which case util_avg is lower than util_est. But even in this case,
> the freq will be update at next enqueue/dequeue/tick.
> The only real case could be when cpu goes idle in shallow state (WFI)
> which is impacted by the freq/voltage. But even in this case, the
> situation should not be long
>
> That being said, I agree that the value used by schedutil is not
> correct at dequeue
>

Hi Dietmar,

as Vincent said, like the following scenario:
   running  sleep
runningsleep
|---|
  ||

  ^^
at the ^^ time, the util_avg is lower than util_est.
we hope that the CPU frequency would be reduced as soon as possible after
the task is dequeued. But the issue affects the sensitivity of cpu
frequency reduce.
worse, since the time, if there is no task enqueue or other scenarios where the
load is updated, the cpu may always maintain a high frequency.

if keep the util_est_dequeue as it is, are there other concerns,
or this patch would affect other places of system?


> >
> > If I would run with your patch cpu_util_cfs() would chose between 597 and 0
> > whereas without it does between 597 and 601.
> >
> > Do you have a specific use case in mind? Or even test results showing a 
> > benefit
> > of your patch?
> >
> > > Dietmar Eggemann  于2020年12月11日周五 下午7:30写道:
> > >
> > >> Hi Yan,
> > >>
> > >> On 09/12/2020 11:44, Xuewen Yan wrote:
> > >>> when a task dequeued, it will update it's util, and cfs_rq_util_change
> > >>> would check rq's util, if the cfs_rq->avg.util_est.enqueued is bigger
> > >>> than  cfs_rq->avg.util_avg, but because the 
> > >>> cfs_rq->avg.util_est.enqueued
> > >>> didn't be decreased, this would cause bigger cfs_rq_util by mistake,
> > >>> as a result, cfs_rq_util_change may change freq unreasonablely.
> > >>>
> > >>> separate the util_est_dequeue() into util_est_dequeue() and
> > >>> util_est_update(), and dequeue the _task_util_est(p) before update util.
> > >>
> > >> The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
> > >> cpu_util_cfs() operates on an old  cfs_rq->avg.util_est.enqueued value?
> > >>
> > >> cpu_util_cfs()
> > >>
> > >> if (sched_feat(UTIL_EST))
> > >&

[PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change

2020-12-09 Thread Xuewen Yan
when a task dequeued, it will update it's util, and cfs_rq_util_change
would check rq's util, if the cfs_rq->avg.util_est.enqueued is bigger
than  cfs_rq->avg.util_avg, but because the cfs_rq->avg.util_est.enqueued
didn't be decreased, this would cause bigger cfs_rq_util by mistake,
as a result, cfs_rq_util_change may change freq unreasonablely.

separate the util_est_dequeue() into util_est_dequeue() and
util_est_update(), and dequeue the _task_util_est(p) before update util.

Signed-off-by: Xuewen Yan 
---
 kernel/sched/fair.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba..20ecfd5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3946,11 +3946,9 @@ static inline bool within_margin(int value, int margin)
 }
 
 static void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
 {
-   long last_ewma_diff;
struct util_est ue;
-   int cpu;
 
if (!sched_feat(UTIL_EST))
return;
@@ -3961,6 +3959,17 @@ static inline bool within_margin(int value, int margin)
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
 
trace_sched_util_est_cfs_tp(cfs_rq);
+}
+
+static void
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+{
+   long last_ewma_diff;
+   struct util_est ue;
+   int cpu;
+
+   if (!sched_feat(UTIL_EST))
+   return;
 
/*
 * Skip update of task's estimated utilization when the task has not
@@ -4085,7 +4094,10 @@ static inline int newidle_balance(struct rq *rq, struct 
rq_flags *rf)
 util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 
 static inline void
-util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
 bool task_sleep) {}
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
@@ -5589,6 +5601,8 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
 
+   util_est_dequeue(>cfs, p);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -5639,7 +5653,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
rq->next_balance = jiffies;
 
 dequeue_throttle:
-   util_est_dequeue(>cfs, p, task_sleep);
+   util_est_update(>cfs, p, task_sleep);
hrtick_update(rq);
 }
 
-- 
1.9.1



[PATCH v3] sched: revise the initial value of the util_avg.

2020-11-05 Thread Xuewen Yan
According to the original code logic:
cfs_rq->avg.util_avg
sa->util_avg  =  * se->load.weight
cfs_rq->avg.load_avg
but for fair_sched_class in 64bits platform:
se->load.weight = 1024 * sched_prio_to_weight[prio];
cfs_rq->avg.util_avg
so the   must be extremely small, the
cfs_rq->avg.load_avg
judgment condition "sa->util_avg < cap" could be established.
It's not fair for those tasks who has smaller nice value.

Signed-off-by: Xuewen Yan 
---
changes since V2:

*kernel/sched/fair.c | 6 +-
* 1 file changed, 5 insertions(+), 1 deletion(-)
*
*diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
*index 290f9e3..079760b 100644
*--- a/kernel/sched/fair.c
*+++ b/kernel/sched/fair.c
*@@ -794,7 +794,11 @@ void post_init_entity_util_avg(struct task_struct *p)
*
*if (cap > 0) {
*if (cfs_rq->avg.util_avg != 0) {
*-   sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
*+   if (p->sched_class == _sched_class)
*+   sa->util_avg  = cfs_rq->avg.util_avg * 
se_weight(se);
*+   else
*+   sa->util_avg  = cfs_rq->avg.util_avg * 
se->load.weight;
*+
*sa->util_avg /= (cfs_rq->avg.load_avg + 1);
*
*if (sa->util_avg > cap)
*
---
comment from Vincent Guittot :
>
> According to the original code logic:
> cfs_rq->avg.util_avg
> sa->util_avg  =  * se->load.weight
> cfs_rq->avg.load_avg

this should have been scale_load_down(se->load.weight) from the beginning

> but for fair_sched_class:
> se->load.weight = 1024 * sched_prio_to_weight[prio];

This is only true for 64bits platform otherwise scale_load and
scale_load_down are nop

> cfs_rq->avg.util_avg
> so the   must be extremely small, the
> cfs_rq->avg.load_avg
> judgment condition "sa->util_avg < cap" could be established.
> It's not fair for those tasks who has smaller nice value.
>
> Signed-off-by: Xuewen Yan 
> ---
>  kernel/sched/fair.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 290f9e3..079760b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -794,7 +794,11 @@ void post_init_entity_util_avg(struct task_struct *p)
>
> if (cap > 0) {
> if (cfs_rq->avg.util_avg != 0) {

We should now use cpu_util() instead of cfs_rq->avg.util_avg which
takes into account other classes

> -   sa->util_avg  = cfs_rq->avg.util_avg * 
> se->load.weight;
> +   if (p->sched_class == _sched_class)
> +   sa->util_avg  = cfs_rq->avg.util_avg * 
> se_weight(se);
> +   else
> +   sa->util_avg  = cfs_rq->avg.util_avg * 
> se->load.weight;

Why this else keeps using se->load.weight ?

Either we uses sa->util_avg  = cfs_rq->avg.util_avg * se_weight(se);
for all classes

Or we want a different init value for other classes. But in this case
se->load.weight is meaningless and we should simply set them to 0
although we could probably compute a value based on bandwidth for
deadline class.

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 290f9e3..c6186cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -794,7 +794,7 @@ void post_init_entity_util_avg(struct task_struct *p)
 
if (cap > 0) {
if (cfs_rq->avg.util_avg != 0) {
-   sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
+   sa->util_avg  = cfs_rq->avg.util_avg * se_weight(se);
sa->util_avg /= (cfs_rq->avg.load_avg + 1);
 
if (sa->util_avg > cap)
-- 
1.9.1



答复: [PATCH v2] sched: revise the initial value of the util_avg.

2020-11-05 Thread Xuewen Yan
Hi,Vincent,

At first, I think "sa->util_avg  = cfs_rq->avg.util_avg * se_weight(se)" should
apply to all classes, but I am not sure how this modification would affect 
other classes.
so I added this cfs condition.

As you said, this is meaningless for this else keeps using se->load.weight.

Thanks!



发件人: Vincent Guittot 
发送时间: 2020年11月5日 20:30
收件人: 闫学文 (Xuewen Yan)
抄送: Ingo Molnar; Peter Zijlstra; Juri Lelli; Dietmar Eggemann; Steven Rostedt; 
Ben Segall; Mel Gorman; Daniel Bristot de Oliveira; linux-kernel; 
xuewen.ya...@gmail.com
主题: Re: [PATCH v2] sched: revise the initial value of the util_avg.

On Thu, 5 Nov 2020 at 12:22, Xuewen Yan  wrote:
>
> According to the original code logic:
> cfs_rq->avg.util_avg
> sa->util_avg  =  * se->load.weight
> cfs_rq->avg.load_avg

this should have been scale_load_down(se->load.weight) from the beginning

> but for fair_sched_class:
> se->load.weight = 1024 * sched_prio_to_weight[prio];

This is only true for 64bits platform otherwise scale_load and
scale_load_down are nop

> cfs_rq->avg.util_avg
> so the   must be extremely small, the
> cfs_rq->avg.load_avg
> judgment condition "sa->util_avg < cap" could be established.
> It's not fair for those tasks who has smaller nice value.
>
> Signed-off-by: Xuewen Yan 
> ---
>  kernel/sched/fair.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 290f9e3..079760b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -794,7 +794,11 @@ void post_init_entity_util_avg(struct task_struct *p)
>
> if (cap > 0) {
> if (cfs_rq->avg.util_avg != 0) {

We should now use cpu_util() instead of cfs_rq->avg.util_avg which
takes into account other classes

> -   sa->util_avg  = cfs_rq->avg.util_avg * 
> se->load.weight;
> +   if (p->sched_class == _sched_class)
> +   sa->util_avg  = cfs_rq->avg.util_avg * 
> se_weight(se);
> +   else
> +   sa->util_avg  = cfs_rq->avg.util_avg * 
> se->load.weight;

Why this else keeps using se->load.weight ?

Either we uses sa->util_avg  = cfs_rq->avg.util_avg * se_weight(se);
for all classes

Or we want a different init value for other classes. But in this case
se->load.weight is meaningless and we should simply set them to 0
although we could probably compute a value based on bandwidth for
deadline class

> +
> sa->util_avg /= (cfs_rq->avg.load_avg + 1);
>
> if (sa->util_avg > cap)
> --
> 1.9.1
>
> 
>  This email (including its attachments) is intended only for the person or 
> entity to which it is addressed and may contain information that is 
> privileged, confidential or otherwise protected from disclosure. Unauthorized 
> use, dissemination, distribution or copying of this email or the information 
> herein or taking any action in reliance on the contents of this email or the 
> information herein, by anyone other than the intended recipient, or an 
> employee or agent responsible for delivering the message to the intended 
> recipient, is strictly prohibited. If you are not the intended recipient, 
> please do not read, copy, use or disclose any part of this e-mail to others. 
> Please notify the sender immediately and permanently delete this e-mail and 
> any attachments if you received it in error. Internet communications cannot 
> be guaranteed to be timely, secure, error-free or virus-free. The sender does 
> not accept liability for any errors or omissions.
> 本邮件及其附件具有保密性质,受法律保护不得泄露,仅发送给本邮件所指特定收件人。严禁非经授权使用、宣传、发布或复制本邮件或其内容。若非该特定收件人,请勿阅读、复制、
>  
> 使用或披露本邮件的任何内容。若误收本邮件,请从系统中永久性删除本邮件及所有附件,并以回复邮件的方式即刻告知发件人。无法保证互联网通信及时、安全、无误或防毒。发件人对任何错漏均不承担责任。

 This email (including its attachments) is intended only for the person or 
entity to which it is addressed and may contain information that is privileged, 
confidential or otherwise protected from disclosure. Unauthorized use, 
dissemination, distribution or copying of this email or the information herein 
or taking any action in reliance on the contents of this email or the 
information herein, by anyone other than the intended recipient, or an employee 
or agent responsible for delivering the message to the intended recipient, is 
strictly prohibited. If you are not the intended recipient, please do not read, 
copy, use or disclose any part of this e-mail to others. Please notify the 
sender immediately an

[PATCH v2] sched: revise the initial value of the util_avg.

2020-11-05 Thread Xuewen Yan
According to the original code logic:
cfs_rq->avg.util_avg
sa->util_avg  =  * se->load.weight
cfs_rq->avg.load_avg
but for fair_sched_class:
se->load.weight = 1024 * sched_prio_to_weight[prio];
cfs_rq->avg.util_avg
so the   must be extremely small, the
cfs_rq->avg.load_avg
judgment condition "sa->util_avg < cap" could be established.
It's not fair for those tasks who has smaller nice value.

Signed-off-by: Xuewen Yan 
---
 kernel/sched/fair.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 290f9e3..079760b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -794,7 +794,11 @@ void post_init_entity_util_avg(struct task_struct *p)

if (cap > 0) {
if (cfs_rq->avg.util_avg != 0) {
-   sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
+   if (p->sched_class == _sched_class)
+   sa->util_avg  = cfs_rq->avg.util_avg * 
se_weight(se);
+   else
+   sa->util_avg  = cfs_rq->avg.util_avg * 
se->load.weight;
+
sa->util_avg /= (cfs_rq->avg.load_avg + 1);

if (sa->util_avg > cap)
--
1.9.1


 This email (including its attachments) is intended only for the person or 
entity to which it is addressed and may contain information that is privileged, 
confidential or otherwise protected from disclosure. Unauthorized use, 
dissemination, distribution or copying of this email or the information herein 
or taking any action in reliance on the contents of this email or the 
information herein, by anyone other than the intended recipient, or an employee 
or agent responsible for delivering the message to the intended recipient, is 
strictly prohibited. If you are not the intended recipient, please do not read, 
copy, use or disclose any part of this e-mail to others. Please notify the 
sender immediately and permanently delete this e-mail and any attachments if 
you received it in error. Internet communications cannot be guaranteed to be 
timely, secure, error-free or virus-free. The sender does not accept liability 
for any errors or omissions.
本邮件及其附件具有保密性质,受法律保护不得泄露,仅发送给本邮件所指特定收件人。严禁非经授权使用、宣传、发布或复制本邮件或其内容。若非该特定收件人,请勿阅读、复制、
 
使用或披露本邮件的任何内容。若误收本邮件,请从系统中永久性删除本邮件及所有附件,并以回复邮件的方式即刻告知发件人。无法保证互联网通信及时、安全、无误或防毒。发件人对任何错漏均不承担责任。


[PATCH v2] sched: sched_domain fix highest_flag_domain function

2020-10-26 Thread Xuewen Yan
the highest_flag_domain is to search the highest sched_domain
containing flag, but if the lower sched_domain didn't contain
the flag, but the higher sched_domain contains the flag, the
function will return NULL instead of the higher sched_domain.

For example:
In MC domain : no SD_ASYM_CPUCAPACITY flag;
In DIE domain : containing SD_ASYM_CPUCAPACITY flag;
the "highest_flag_domain(cpu, SD_ASYM_CPUCAPACITY)" will return NULL.

Signed-off-by: Xuewen Yan 
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6..2c7c566 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1427,7 +1427,7 @@ static inline struct sched_domain 
*highest_flag_domain(int cpu, int flag)
 
for_each_domain(cpu, sd) {
if (!(sd->flags & flag))
-   break;
+   continue;
hsd = sd;
}
 
-- 
1.9.1