Hi Tao, On Tue, 30 Jun 2020 at 11:41, Tao Zhou <ouwen...@hotmail.com> wrote: > > Hi, > > On Tue, Jun 30, 2020 at 09:43:11AM +0200, Vincent Guittot wrote: > > Hi Tao, > > > > Le lundi 15 juin 2020 à 16:14:01 (+0800), Xing Zhengjun a écrit : > > > > > > > > > On 6/15/2020 1:18 PM, Tao Zhou wrote: > > > > ... > > > > > I apply the patch based on v5.7, the regression still existed. > > > > > > Could you try the patch below ? This patch is not a real fix because it > > impacts performance of others benchmarks but it will at least narrow your > > problem. > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 9f78eb76f6fb..a4d8614b1854 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -8915,9 +8915,9 @@ find_idlest_group(struct sched_domain *sd, struct > > task_struct *p, int this_cpu) > > * and consider staying local. > > */ > > > > - if ((sd->flags & SD_NUMA) && > > - ((idlest_sgs.avg_load + imbalance) >= > > local_sgs.avg_load)) > > - return NULL; > > +// if ((sd->flags & SD_NUMA) && > > +// ((idlest_sgs.avg_load + imbalance) >= > > local_sgs.avg_load)) > > +// return NULL; > > Just narrow to the fork (wakeup) path that maybe lead the problem, /me think.
The perf regression seems to be fixed with this patch on my setup. According to the statistics that I have on the use case, groups are overloaded but load is quite low and this low level hits this NUMA specific condition > Some days ago, I tried this patch: > > https://lore.kernel.org/lkml/20200616164801.18644-1-peter.pu...@linaro.org/ > > --- > kernel/sched/fair.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 02f323b85b6d..abcbdf80ee75 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group > *idlest, > > case group_has_spare: > /* Select group with most idle CPUs */ > - if (idlest_sgs->idle_cpus >= sgs->idle_cpus) > + if (idlest_sgs->idle_cpus > sgs->idle_cpus) > return false; > + > + /* Select group with lowest group_util */ > + if (idlest_sgs->idle_cpus == sgs->idle_cpus && > + idlest_sgs->group_util <= sgs->group_util) > + return false; > + > break; > } > > -- > > This patch is related to wake up slow path and group type is full_busy. I tried it but haven't seen impacts on mmapfork test results > What I tried that got improved: > > $> sysbench threads --threads=16 run > > The total num of event(high is better): > > v5.8-rc1 : 34020 34494 33561 > v5.8-rc1+patch: 35466 36184 36260 > > $> perf bench -f simple sched pipe -l 4000000 > > v5.8-rc1 : 16.203 16.238 16.150 > v5.8-rc1+patch: 15.757 15.930 15.819 > > I also saw some regressions about other workloads(dont know much). > So, suggest to test this patch about this stress-ng.mmapfork. I didn't do > this yet. > > Another patch i want to mention here is this(merged to V5.7 now): > > commit 68f7b5cc83 ("sched/cfs: change initial value of runnable_avg") > > And this regression happened based on V5.7. This patch is related to fork > wake up path of overloaded type. Absolutely need to try then. > > Finally, I have given a patch that seems not related to fork wake up path, > but I also tried it on some benchmark. But, did not saw improvement there. > I also give this changed patch here(I realized that full_busy type idle cpu > first but not sure). Maybe not need to try. > > Index: core.bak/kernel/sched/fair.c > =================================================================== > --- core.bak.orig/kernel/sched/fair.c > +++ core.bak/kernel/sched/fair.c > @@ -9226,17 +9226,20 @@ static struct sched_group *find_busiest_ > goto out_balanced; > > if (busiest->group_weight > 1 && > - local->idle_cpus <= (busiest->idle_cpus + 1)) > - /* > - * If the busiest group is not overloaded > - * and there is no imbalance between this and busiest > - * group wrt idle CPUs, it is balanced. The imbalance > - * becomes significant if the diff is greater than 1 > - * otherwise we might end up to just move the > imbalance > - * on another group. Of course this applies only if > - * there is more than 1 CPU per group. > - */ > - goto out_balanced; > + local->idle_cpus <= (busiest->idle_cpus + 1)) { > + if (local->group_type == group_has_spare) { > + /* > + * If the busiest group is not overloaded > + * and there is no imbalance between this and > busiest > + * group wrt idle CPUs, it is balanced. The > imbalance > + * becomes significant if the diff is greater > than 1 > + * otherwise we might end up to just move the > imbalance > + * on another group. Of course this applies > only if > + * there is more than 1 CPU per group. > + */ > + goto out_balanced; > + } > + } > > if (busiest->sum_h_nr_running == 1) > /* > > > TBH, I don't know much about the below numbers. > > Thank you for the help! > > Thanks. > > > /* > > * If the local group is less loaded than the selected > > > > -- > > > > > > > ========================================================================================= > > > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/sc_pid_max/testtime/class/cpufreq_governor/ucode: > > > > > > lkp-bdw-ep6/stress-ng/debian-x86_64-20191114.cgz/x86_64-rhel-7.6/gcc-7/100%/1HDD/4194304/1s/scheduler/performance/0xb000038 > > > > > > commit: > > > e94f80f6c49020008e6fa0f3d4b806b8595d17d8 > > > 6c8116c914b65be5e4d6f66d69c8142eb0648c22 > > > v5.7 > > > c7e6d37f60da32f808140b1b7dabcc3cde73c4cc (Tao's patch) > > > > > > e94f80f6c4902000 6c8116c914b65be5e4d6f66d69c v5.7 > > > c7e6d37f60da32f808140b1b7da > > > ---------------- --------------------------- --------------------------- > > > --------------------------- > > > %stddev %change %stddev %change %stddev > > > %change > > > %stddev > > > \ | \ | \ > > > | \ > > > 819250 ± 5% -10.1% 736616 ± 8% +41.2% 1156877 ± 3% > > > +43.6% 1176246 ± 5% stress-ng.futex.ops > > > 818985 ± 5% -10.1% 736460 ± 8% +41.2% 1156215 ± 3% > > > +43.6% 1176055 ± 5% stress-ng.futex.ops_per_sec > > > 1551 ± 3% -3.4% 1498 ± 5% -4.6% 1480 ± 5% > > > -14.3% 1329 ± 11% stress-ng.inotify.ops > > > 1547 ± 3% -3.5% 1492 ± 5% -4.8% 1472 ± 5% > > > -14.3% 1326 ± 11% stress-ng.inotify.ops_per_sec > > > 11292 ± 8% -2.8% 10974 ± 8% -9.4% 10225 ± 6% > > > -10.1% 10146 ± 6% stress-ng.kill.ops > > > 11317 ± 8% -2.6% 11023 ± 8% -9.1% 10285 ± 5% > > > -10.3% 10154 ± 6% stress-ng.kill.ops_per_sec > > > 28.20 ± 4% -35.4% 18.22 -33.4% 18.77 > > > -27.7% 20.40 ± 9% stress-ng.mmapfork.ops_per_sec > > > 2999012 ± 21% -10.1% 2696954 ± 22% -88.5% 344447 ± 11% > > > -87.8% 364932 stress-ng.tee.ops_per_sec > > > 7882 ± 3% -5.4% 7458 ± 4% -2.0% 7724 ± 3% > > > -2.2% 7709 ± 4% stress-ng.vforkmany.ops > > > 7804 ± 3% -5.2% 7400 ± 4% -2.0% 7647 ± 3% > > > -2.1% 7636 ± 4% stress-ng.vforkmany.ops_per_sec > > > 46745421 ± 3% -8.1% 42938569 ± 3% -5.2% 44312072 ± 4% > > > -2.3% 45648193 stress-ng.yield.ops > > > 46734472 ± 3% -8.1% 42926316 ± 3% -5.2% 44290338 ± 4% > > > -2.4% 45627571 stress-ng.yield.ops_per_sec > > > > > > > > > > > > > ... > > > > > -- > > > Zhengjun Xing