Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec -38.0% regression
On Fri, 10 Jul 2020 at 14:08, Tao Zhou wrote: > > Hi Vincent, > > Sorry for this so late reply. > > On Tue, Jun 30, 2020 at 04:22:10PM +0200, Vincent Guittot wrote: > > Hi Tao, > > > > On Tue, 30 Jun 2020 at 11:41, Tao Zhou wrote: > > > > > > Hi, > > > > > > On Tue, Jun 30, 2020 at 09:43:11AM +0200, Vincent Guittot wrote: > > > > Hi Tao, > > > > > > > > Le lundi 15 juin 2020 à 16:14:01 (+0800), Xing Zhengjun a écrit : > > > > > > > > > > > > > > > On 6/15/2020 1:18 PM, Tao Zhou wrote: > > > > > > > > ... > > > > > > > > > I apply the patch based on v5.7, the regression still existed. > > > > > > > > > > > > Could you try the patch below ? This patch is not a real fix because > > > > it impacts performance of others benchmarks but it will at least narrow > > > > your problem. > > > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index 9f78eb76f6fb..a4d8614b1854 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -8915,9 +8915,9 @@ find_idlest_group(struct sched_domain *sd, struct > > > > task_struct *p, int this_cpu) > > > > * and consider staying local. > > > > */ > > > > > > > > - if ((sd->flags & SD_NUMA) && > > > > - ((idlest_sgs.avg_load + imbalance) >= > > > > local_sgs.avg_load)) > > > > - return NULL; > > > > +// if ((sd->flags & SD_NUMA) && > > > > +// ((idlest_sgs.avg_load + imbalance) >= > > > > local_sgs.avg_load)) > > > > +// return NULL; > > > > > > Just narrow to the fork (wakeup) path that maybe lead the problem, /me > > > think. > > > > The perf regression seems to be fixed with this patch on my setup. > > According to the statistics that I have on the use case, groups are > > overloaded but load is quite low and this low level hits this NUMA > > specific condition > > My box has 1 Socket, 4 Core, 2 Threads per core and 2x4 CPUS. > (x86_64 Intel(R) Core(TM) i7-6700HQ) The change above only applies to NUMA system which doesn't seems to be the case for your setup > > stress-ng.mmapfork > > v5.8-rc4: > > stress-ng: info: [7158] dispatching hogs: 8 mmapfork > stress-ng: info: [7158] successful run completed in 1.09s > stress-ng: info: [7158] stressor bogo ops real time usr time sys > time bogo ops/s bogo ops/s > stress-ng: info: [7158] (secs)(secs)(secs) > (real time) (usr+sys time) > stress-ng: info: [7158] mmapfork 32 1.09 2.48 > 6.0129.36 3.77 > stress-ng: info: [7158] for a 1.09s run time: > stress-ng: info: [7158] 8.73s available CPU time > stress-ng: info: [7158] 2.52s user time ( 28.86%) > stress-ng: info: [7158] 6.07s system time ( 69.52%) > stress-ng: info: [7158] 8.59s total time ( 98.38%) > stress-ng: info: [7158] load average: 0.52 0.26 0.10 > > v5.8-rc4 w/ above patch: > > stress-ng: info: [5126] dispatching hogs: 8 mmapfork > stress-ng: info: [5126] successful run completed in 1.07s > stress-ng: info: [5126] stressor bogo ops real time usr time sys > time bogo ops/s bogo ops/s > stress-ng: info: [5126] (secs)(secs)(secs) > (real time) (usr+sys time) > stress-ng: info: [5126] mmapfork 32 1.07 2.45 > 5.9629.88 3.80 > stress-ng: info: [5126] for a 1.07s run time: > stress-ng: info: [5126] 8.58s available CPU time > stress-ng: info: [5126] 2.49s user time ( 29.02%) > stress-ng: info: [5126] 6.02s system time ( 70.17%) > stress-ng: info: [5126] 8.51s total time ( 99.19%) > stress-ng: info: [5126] load average: 0.31 0.22 0.09 > > No obvious changes. Yeah, the problem happens for system with several numa nodes > > And I traced and also tried to find the task_h_load = 0 after the patch you > sent. > > I used the command: > > trace-cmd record -e sched -e irq -e cpu_idle -e cpu_frequency -e timer cgexec > -g cpu:A\ > stress-ng --timeout 1 --times --verify--metrics-brief --sequential 8 --class > scheduler\ > --exclude (all exclude but mmapfork) > ><...>-26132 [000] 6571.361156: bprint: task_h_load: > cfs_rq->h_load:119, p->load_avg:26, cfs_rq->load_avg:14487 ><...>-26132 [000] 6571.361156: bprint: > load_balance: detach_task migrate_load: task_h_load orginal: 0 > > If cgroup has three levels(first tried), I can not find the task_h_load = 0 > case. Your system is is not large enough to face the problem > group se's weight is relate to the task_group's share. > task's weight is its weight. > > if (!tg->parent) { > load = cpu_rq(cpu)->load.weight; > } else { > load = tg->parent->cfs_rq[cpu]->h_load; > load *= tg->cfs_rq[cpu]->shares; > load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec -38.0% regression
Hi Tao, On Tue, 30 Jun 2020 at 11:41, Tao Zhou wrote: > > Hi, > > On Tue, Jun 30, 2020 at 09:43:11AM +0200, Vincent Guittot wrote: > > Hi Tao, > > > > Le lundi 15 juin 2020 à 16:14:01 (+0800), Xing Zhengjun a écrit : > > > > > > > > > On 6/15/2020 1:18 PM, Tao Zhou wrote: > > > > ... > > > > > I apply the patch based on v5.7, the regression still existed. > > > > > > Could you try the patch below ? This patch is not a real fix because it > > impacts performance of others benchmarks but it will at least narrow your > > problem. > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 9f78eb76f6fb..a4d8614b1854 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -8915,9 +8915,9 @@ find_idlest_group(struct sched_domain *sd, struct > > task_struct *p, int this_cpu) > > * and consider staying local. > > */ > > > > - if ((sd->flags & SD_NUMA) && > > - ((idlest_sgs.avg_load + imbalance) >= > > local_sgs.avg_load)) > > - return NULL; > > +// if ((sd->flags & SD_NUMA) && > > +// ((idlest_sgs.avg_load + imbalance) >= > > local_sgs.avg_load)) > > +// return NULL; > > Just narrow to the fork (wakeup) path that maybe lead the problem, /me think. The perf regression seems to be fixed with this patch on my setup. According to the statistics that I have on the use case, groups are overloaded but load is quite low and this low level hits this NUMA specific condition > Some days ago, I tried this patch: > > https://lore.kernel.org/lkml/20200616164801.18644-1-peter.pu...@linaro.org/ > > --- > kernel/sched/fair.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 02f323b85b6d..abcbdf80ee75 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group > *idlest, > > case group_has_spare: > /* Select group with most idle CPUs */ > - if (idlest_sgs->idle_cpus >= sgs->idle_cpus) > + if (idlest_sgs->idle_cpus > sgs->idle_cpus) > return false; > + > + /* Select group with lowest group_util */ > + if (idlest_sgs->idle_cpus == sgs->idle_cpus && > + idlest_sgs->group_util <= sgs->group_util) > + return false; > + > break; > } > > -- > > This patch is related to wake up slow path and group type is full_busy. I tried it but haven't seen impacts on mmapfork test results > What I tried that got improved: > > $> sysbench threads --threads=16 run > > The total num of event(high is better): > > v5.8-rc1 : 3402034494 33561 > v5.8-rc1+patch: 3546636184 36260 > > $> perf bench -f simple sched pipe -l 400 > > v5.8-rc1 : 16.203 16.238 16.150 > v5.8-rc1+patch: 15.757 15.930 15.819 > > I also saw some regressions about other workloads(dont know much). > So, suggest to test this patch about this stress-ng.mmapfork. I didn't do > this yet. > > Another patch i want to mention here is this(merged to V5.7 now): > > commit 68f7b5cc83 ("sched/cfs: change initial value of runnable_avg") > > And this regression happened based on V5.7. This patch is related to fork > wake up path of overloaded type. Absolutely need to try then. > > Finally, I have given a patch that seems not related to fork wake up path, > but I also tried it on some benchmark. But, did not saw improvement there. > I also give this changed patch here(I realized that full_busy type idle cpu > first but not sure). Maybe not need to try. > > Index: core.bak/kernel/sched/fair.c > === > --- core.bak.orig/kernel/sched/fair.c > +++ core.bak/kernel/sched/fair.c > @@ -9226,17 +9226,20 @@ static struct sched_group *find_busiest_ > goto out_balanced; > > if (busiest->group_weight > 1 && > - local->idle_cpus <= (busiest->idle_cpus + 1)) > - /* > -* If the busiest group is not overloaded > -* and there is no imbalance between this and busiest > -* group wrt idle CPUs, it is balanced. The imbalance > -* becomes significant if the diff is greater than 1 > -* otherwise we might end up to just move the > imbalance > -* on another group. Of course this applies only if > -* there is more than 1 CPU per group. > -*/ > - goto out_balanced; > + local->idle_cpus <= (busiest->idle_cpus + 1)) { > + if (local->group_type == group_has_spare) { > +
Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec -38.0% regression
Hi Tao, Le lundi 15 juin 2020 à 16:14:01 (+0800), Xing Zhengjun a écrit : > > > On 6/15/2020 1:18 PM, Tao Zhou wrote: ... > I apply the patch based on v5.7, the regression still existed. Could you try the patch below ? This patch is not a real fix because it impacts performance of others benchmarks but it will at least narrow your problem. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9f78eb76f6fb..a4d8614b1854 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8915,9 +8915,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) * and consider staying local. */ - if ((sd->flags & SD_NUMA) && - ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load)) - return NULL; +// if ((sd->flags & SD_NUMA) && +// ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load)) +// return NULL; /* * If the local group is less loaded than the selected -- > = > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/sc_pid_max/testtime/class/cpufreq_governor/ucode: > > lkp-bdw-ep6/stress-ng/debian-x86_64-20191114.cgz/x86_64-rhel-7.6/gcc-7/100%/1HDD/4194304/1s/scheduler/performance/0xb38 > > commit: > e94f80f6c49020008e6fa0f3d4b806b8595d17d8 > 6c8116c914b65be5e4d6f66d69c8142eb0648c22 > v5.7 > c7e6d37f60da32f808140b1b7dabcc3cde73c4cc (Tao's patch) > > e94f80f6c4902000 6c8116c914b65be5e4d6f66d69cv5.7 > c7e6d37f60da32f808140b1b7da > --- --- > --- > %stddev %change %stddev %change %stddev %change > %stddev > \ |\ |\ > |\ > 819250 ± 5% -10.1% 736616 ± 8% +41.2%1156877 ± 3% > +43.6%1176246 ± 5% stress-ng.futex.ops > 818985 ± 5% -10.1% 736460 ± 8% +41.2%1156215 ± 3% > +43.6%1176055 ± 5% stress-ng.futex.ops_per_sec > 1551 ± 3% -3.4% 1498 ± 5% -4.6% 1480 ± 5% > -14.3% 1329 ± 11% stress-ng.inotify.ops > 1547 ± 3% -3.5% 1492 ± 5% -4.8% 1472 ± 5% > -14.3% 1326 ± 11% stress-ng.inotify.ops_per_sec > 11292 ± 8% -2.8% 10974 ± 8% -9.4% 10225 ± 6% > -10.1% 10146 ± 6% stress-ng.kill.ops > 11317 ± 8% -2.6% 11023 ± 8% -9.1% 10285 ± 5% > -10.3% 10154 ± 6% stress-ng.kill.ops_per_sec > 28.20 ± 4% -35.4% 18.22 -33.4% 18.77 > -27.7% 20.40 ± 9% stress-ng.mmapfork.ops_per_sec >2999012 ± 21% -10.1%2696954 ± 22% -88.5% 37 ± 11% > -87.8% 364932stress-ng.tee.ops_per_sec > 7882 ± 3% -5.4% 7458 ± 4% -2.0% 7724 ± 3% > -2.2% 7709 ± 4% stress-ng.vforkmany.ops > 7804 ± 3% -5.2% 7400 ± 4% -2.0% 7647 ± 3% > -2.1% 7636 ± 4% stress-ng.vforkmany.ops_per_sec > 46745421 ± 3% -8.1% 42938569 ± 3% -5.2% 44312072 ± 4% > -2.3% 45648193stress-ng.yield.ops > 46734472 ± 3% -8.1% 42926316 ± 3% -5.2% 44290338 ± 4% > -2.4% 45627571stress-ng.yield.ops_per_sec > > > ... > -- > Zhengjun Xing
Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec -38.0% regression
On 6/15/2020 1:18 PM, Tao Zhou wrote: Hi, On Fri, Jun 12, 2020 at 03:59:31PM +0800, Xing Zhengjun wrote: Hi, I test the regression, it still existed in v5.7. If you have any fix for it, please send it to me, I can verify it. Thanks. When busiest group is group_fully_busy and local group <= group_fully_busy the metric used: local group busiest group use metric group_fully_busy group_fully_busy avg load group_has_sparegroup_fully_busy idle cpu/task num In find_busiest_group() about this condition: 'if (busiest->group_type != group_overloaded) {' in this case, busiest type is group_fully_busy, local type <= group_fully_busy. in this branch, it check idle cpu and task num and can go to out_balance. That is to say ignore group_fully_busy other than group_has_spare(this case is done in calculate_imbalance()). When local group and busiest group are all group_fully_busy, need to use avg load to metric(in calculate_imbalance()). So give the below change: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cbcb2f71599b..0afbea39dd5a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9219,24 +9219,26 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ goto out_balanced; - if (busiest->group_weight > 1 && - local->idle_cpus <= (busiest->idle_cpus + 1)) - /* -* If the busiest group is not overloaded -* and there is no imbalance between this and busiest -* group wrt idle CPUs, it is balanced. The imbalance -* becomes significant if the diff is greater than 1 -* otherwise we might end up to just move the imbalance -* on another group. Of course this applies only if -* on another group. Of course this applies only if -* there is more than 1 CPU per group. -*/ - goto out_balanced; + if (local->group_type == group_has_spare) { + if (busiest->group_weight > 1 && + local->idle_cpus <= (busiest->idle_cpus + 1)) + /* +* If the busiest group is not overloaded +* and there is no imbalance between this and busiest +* group wrt idle CPUs, it is balanced. The imbalance +* becomes significant if the diff is greater than 1 +* otherwise we might end up to just move the imbalance +* on another group. Of course this applies only if +* there is more than 1 CPU per group. +*/ + goto out_balanced; - if (busiest->sum_h_nr_running == 1) - /* -* busiest doesn't have any tasks waiting to run -*/ - goto out_balanced; + if (busiest->sum_h_nr_running == 1) + /* +* busiest doesn't have any tasks waiting to run +*/ + goto out_balanced; + } } force_balance: In fact, I don't know this change can help or not, can be right or not. No test, no compile. If it is wrong, just ignore. Thanks I apply the patch based on v5.7, the regression still existed. = tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/sc_pid_max/testtime/class/cpufreq_governor/ucode: lkp-bdw-ep6/stress-ng/debian-x86_64-20191114.cgz/x86_64-rhel-7.6/gcc-7/100%/1HDD/4194304/1s/scheduler/performance/0xb38 commit: e94f80f6c49020008e6fa0f3d4b806b8595d17d8 6c8116c914b65be5e4d6f66d69c8142eb0648c22 v5.7 c7e6d37f60da32f808140b1b7dabcc3cde73c4cc (Tao's patch) e94f80f6c4902000 6c8116c914b65be5e4d6f66d69cv5.7 c7e6d37f60da32f808140b1b7da --- --- --- %stddev %change %stddev %change %stddev %change %stddev \ |\ |\ |\ 819250 ± 5% -10.1% 736616 ± 8% +41.2%1156877 ± 3% +43.6%1176246 ± 5% stress-ng.futex.ops 818985 ± 5% -10.1% 736460 ± 8% +41.2%1156215 ± 3% +43.6%1176055 ± 5% stress-ng.futex.ops_per_sec 1551 ± 3% -3.4% 1498 ±
Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec -38.0% regression
Hi, I test the regression, it still existed in v5.7. If you have any fix for it, please send it to me, I can verify it. Thanks. = tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/sc_pid_max/testtime/class/cpufreq_governor/ucode: lkp-bdw-ep6/stress-ng/debian-x86_64-20191114.cgz/x86_64-rhel-7.6/gcc-7/100%/1HDD/4194304/1s/scheduler/performance/0xb38 commit: e94f80f6c49020008e6fa0f3d4b806b8595d17d8 6c8116c914b65be5e4d6f66d69c8142eb0648c22 v5.7-rc3 v5.7 e94f80f6c4902000 6c8116c914b65be5e4d6f66d69cv5.7-rc3 v5.7 --- --- --- %stddev %change %stddev %change %stddev %change %stddev \ |\ |\ |\ 21398 ± 7% +6.5% 22781 ± 2% -14.5% 18287 ± 4% -5.5% 20231 ± 14% stress-ng.clone.ops 819250 ± 5% -10.1% 736616 ± 8% +34.2%1099410 ± 5% +41.2%1156877 ± 3% stress-ng.futex.ops 818985 ± 5% -10.1% 736460 ± 8% +34.2%1099487 ± 5% +41.2%1156215 ± 3% stress-ng.futex.ops_per_sec 1551 ± 3% -3.4% 1498 ± 5% -9.5% 1404 ± 2% -4.6% 1480 ± 5% stress-ng.inotify.ops 1547 ± 3% -3.5% 1492 ± 5% -9.5% 1400 ± 2% -4.8% 1472 ± 5% stress-ng.inotify.ops_per_sec 11292 ± 8% -2.8% 10974 ± 8% +1.9% 11505 ± 13% -9.4% 10225 ± 6% stress-ng.kill.ops 28.20 ± 4% -35.4% 18.22 -33.5% 18.75 -33.4% 18.77stress-ng.mmapfork.ops_per_sec 1932318+1.5%1961688 ± 2% -22.8%1492231 ± 2% +4.0%2010509 ± 3% stress-ng.softlockup.ops 1931679 ± 2% +1.5%1961143 ± 2% -22.8%1491939 ± 2% +4.0%2009585 ± 3% stress-ng.softlockup.ops_per_sec 18607406 ± 6% -12.9% 16210450 ± 21% -12.7% 16238693 ± 14% -8.0% 17120880 ± 13% stress-ng.switch.ops 18604406 ± 6% -12.9% 16208270 ± 21% -12.7% 16237956 ± 14% -8.0% 17115273 ± 13% stress-ng.switch.ops_per_sec 2999012 ± 21% -10.1%2696954 ± 22% -9.1%2725653 ± 21% -88.5% 37 ± 11% stress-ng.tee.ops_per_sec 7882 ± 3% -5.4% 7458 ± 4% -4.0% 7566 ± 4% -2.0% 7724 ± 3% stress-ng.vforkmany.ops 7804 ± 3% -5.2% 7400 ± 4% -3.8% 7504 ± 4% -2.0% 7647 ± 3% stress-ng.vforkmany.ops_per_sec 46745421 ± 3% -8.1% 42938569 ± 3% -7.8% 43078233 ± 3% -5.2% 44312072 ± 4% stress-ng.yield.ops 46734472 ± 3% -8.1% 42926316 ± 3% -7.8% 43067447 ± 3% -5.2% 44290338 ± 4% stress-ng.yield.ops_per_sec On 4/27/2020 8:46 PM, Vincent Guittot wrote: On Mon, 27 Apr 2020 at 13:35, Hillf Danton wrote: On Mon, 27 Apr 2020 11:03:58 +0200 Vincent Guittot wrote: On Sun, 26 Apr 2020 at 14:42, Hillf Danton wrote: On 4/21/2020 8:47 AM, kernel test robot wrote: Greeting, FYI, we noticed a 56.4% improvement of stress-ng.fifo.ops_per_sec due to commit: commit: 6c8116c914b65be5e4d6f66d69c8142eb0648c22 ("sched/fair: Fix condition of avg_load calculation") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: stress-ng on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 128G memory with following parameters: nr_threads: 100% disk: 1HDD testtime: 1s class: scheduler cpufreq_governor: performance ucode: 0xb38 sc_pid_max: 4194304 We need to handle group_fully_busy in a different way from group_overloaded as task push does not help grow load balance in the former case. Have you tested this patch for the UC above ? Do you have figures ? No I am looking for a box of 88 threads. Likely to get access to it in as early as three weeks. --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8744,30 +8744,20 @@ find_idlest_group(struct sched_domain *s switch (local_sgs.group_type) { case group_overloaded: - case group_fully_busy: - /* -* When comparing groups across NUMA domains, it's possible for -* the local domain to be very lightly loaded relative to the -* remote domains but "imbalance" skews the comparison making -* remote CPUs look much more favourable. When considering -* cross-domain, add imbalance to the load on the remote node -* and consider staying local. -*/ - - if ((sd->flags & SD_NUMA) && - ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load)) + i