On Fri, 26 Mar 2021 at 20:19, Rik van Riel <r...@surriel.com> wrote: > > On Mon, 22 Mar 2021 11:03:06 +0000 > Mel Gorman <mgor...@techsingularity.net> wrote: > > > > Second, select_idle_smt() does not use the cpus mask so consider moving > > the cpus initialisation after select_idle_smt() has been called. > > Specifically this initialisation > > > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > > > Alternatively, clear the bits in the SMT sibling scan to avoid checking > > the siblings twice. It's a tradeoff because initialising and clearing > > bits is not free and the cost is wasted if a sibling is free. > > I tried a number of different variations on moving the CPU mask > initialization, and clearing CPUs from the mask, and failed to > get any clear results from those in testing, even in workloads > with lots of context switches. > > Below is a simple version that seems to perform identically to > more complicated versions :) > > ---8<--- > sched,fair: bring back select_idle_smt, but differently > > Mel Gorman did some nice work in 9fe1f127b913 > ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel > being more efficient at finding an idle CPU, and in tasks spending less > time waiting to be run, both according to the schedstats run_delay > numbers, and according to measured application latencies. Yay. > > The flip side of this is that we see more task migrations (about > 30% more), higher cache misses, higher memory bandwidth utilization, > and higher CPU use, for the same number of requests/second. > > This is most pronounced on a memcache type workload, which saw > a consistent 1-3% increase in total CPU use on the system, due > to those increased task migrations leading to higher L2 cache > miss numbers, and higher memory utilization. The exclusive L3 > cache on Skylake does us no favors there. > > On our web serving workload, that effect is usually negligible. > > It appears that the increased number of CPU migrations is generally > a good thing, since it leads to lower cpu_delay numbers, reflecting > the fact that tasks get to run faster. However, the reduced locality > and the corresponding increase in L2 cache misses hurts a little. > > The patch below appears to fix the regression, while keeping the > benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt > with a twist: when a socket has no idle cores, check to see if the > sibling of "prev" is idle, before searching all the other CPUs. > > This fixes both the occasional 9% regression on the web serving > workload, and the continuous 2% CPU use regression on the memcache > type workload. > > With Mel's patches and this patch together, task migrations are still > high, but L2 cache misses, memory bandwidth, and CPU time used are back > down to what they were before. The p95 and p99 response times for the > memcache type application improve by about 10% over what they were > before Mel's patches got merged. > > Signed-off-by: Rik van Riel <r...@surriel.com> > --- > kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++----------- > 1 file changed, 52 insertions(+), 16 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 794c2cb945f8..69680158963f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6098,6 +6098,28 @@ static int select_idle_core(struct task_struct *p, int > core, struct cpumask *cpu > return -1; > } > > +/* > + * Scan the local SMT mask for idle CPUs. > + */ > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, > int > +target) > +{ > + int cpu; > + > + if (!static_branch_likely(&sched_smt_present)) > + return -1; > + > + for_each_cpu(cpu, cpu_smt_mask(target)) { > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || > + !cpumask_test_cpu(cpu, sched_domain_span(sd))) > + continue; > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > + return cpu; > + } > + > + return -1; > +} > + > #else /* CONFIG_SCHED_SMT */ > > static inline void set_idle_cores(int cpu, int val) > @@ -6114,6 +6136,11 @@ static inline int select_idle_core(struct task_struct > *p, int core, struct cpuma > return __select_idle_cpu(core); > } > > +static inline int select_idle_smt(struct task_struct *p, struct sched_domain > *sd, int target) > +{ > + return -1; > +} > + > #endif /* CONFIG_SCHED_SMT */ > > /* > @@ -6121,7 +6148,7 @@ static inline int select_idle_core(struct task_struct > *p, int core, struct cpuma > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > * average idle time for this rq (as found in rq->avg_idle). > */ > -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, > int target) > +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, > int prev, int target) > { > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > int i, cpu, idle_cpu = -1, nr = INT_MAX; > @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > - if (sched_feat(SIS_PROP) && !smt) { > - u64 avg_cost, avg_idle, span_avg; > + if (!smt) { > + if (cpus_share_cache(prev, target)) {
Have you checked the impact on no smt system ? would worth a static branch. Also, this doesn't need to be in select_idle_cpu() which aims to loop the sched_domain becaus you only compare target and prev. So you can move this call to select_idle_smt() in select_idle_sibling() as part of all the others tests done on prev/target/recent_used_cpu before calling select_idle_cpu() and even save a useless sd = rcu_dereference(per_cpu(sd_llc, target)); > + /* No idle core. Check if prev has an idle sibling. */ > + i = select_idle_smt(p, sd, prev); > + if ((unsigned int)i < nr_cpumask_bits) > + return i; > + } > > - /* > - * Due to large variance we need a large fuzz factor; > - * hackbench in particularly is sensitive here. > - */ > - avg_idle = this_rq()->avg_idle / 512; > - avg_cost = this_sd->avg_scan_cost + 1; > + if (sched_feat(SIS_PROP)) { > + u64 avg_cost, avg_idle, span_avg; > > - span_avg = sd->span_weight * avg_idle; > - if (span_avg > 4*avg_cost) > - nr = div_u64(span_avg, avg_cost); > - else > - nr = 4; > + /* > + * Due to large variance we need a large fuzz factor; > + * hackbench in particularly is sensitive here. > + */ > + avg_idle = this_rq()->avg_idle / 512; > + avg_cost = this_sd->avg_scan_cost + 1; > > - time = cpu_clock(this); > + span_avg = sd->span_weight * avg_idle; > + if (span_avg > 4*avg_cost) > + nr = div_u64(span_avg, avg_cost); > + else > + nr = 4; > + > + time = cpu_clock(this); > + } > } > > for_each_cpu_wrap(cpu, cpus, target) { > @@ -6307,7 +6343,7 @@ static int select_idle_sibling(struct task_struct *p, > int prev, int target) > if (!sd) > return target; > > - i = select_idle_cpu(p, sd, target); > + i = select_idle_cpu(p, sd, prev, target); > if ((unsigned)i < nr_cpumask_bits) > return i; > > -- > 2.25.4 > >