On Thu, Feb 28, 2019 at 03:17:51PM +0800, kernel test robot wrote: > Greeting, > > FYI, we noticed a -41.4% regression of pft.faults_per_sec_per_cpu due to > commit: > > > commit: 2c83362734dad8e48ccc0710b5cd2436a0323893 ("sched/fair: Consider > SD_NUMA when selecting the most idle group to schedule on") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > in testcase: pft > on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with > 64G memory > with following parameters: > > runtime: 300s > nr_task: 50% > cpufreq_governor: performance > ucode: 0xb00002e >
The headline regression looks high but it's also a known consequence for some microbenchmarks, particularly those that are short-lived and consist of non-communicating tasks. The impact of the patch is to favour starting a new task on the local node unless the socket is saturated. This is to avoid a pattern where a task that clones a helper that it communicates with starts on a remote node. Starting remote negatively impacts basis workloads like shellscripts, client/server workloads or pipelined tasks. The workloads that benefit from spreading early are parallelised tasks that do not communicate until end of the task. PFT is an example of the latter. If spread early, it maximises the total memory bandwidth of the machine early in the lifetime of the machine. It would quickly recover if it run long enough, the early measurements are low as it saturates the bandwidth of the local node. This configuration is at 50% and the machine is likely to be 2-socket so it has half the bandwidth in all likelihood and hence the 41.4% regression (very close to half so some tasks probably got load-balanced). On to the other examples; > test-description: Pft is the page fault test micro benchmark. > test-url: https://github.com/gormanm/pft > > In addition to that, the commit also has significant impact on the following > tests: > > +------------------+--------------------------------------------------------------------------+ > | testcase: change | stream: > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | array_size=10000000 > | > | | cpufreq_governor=performance > | > | | nr_threads=25% > | > | | omp=true > | > | | ucode=0xb00002e > | STREAM is typically short-lived. Again, it benefits from spreading early to maximise memory bandwidth. 25% of threads would fit in one node. For parallelised stream tests it's usually the case that OMP is used to bind 1 thread per memory channel using the openmp directives to measure the total machine memory bandwidth rather than using it as a scaling tests. I'm guessing this machine didn't have 22 memory channels that would make nr_thread=25% a sensible configuration. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | reaim: reaim.jobs_per_min 1.3% improvement > | > | test machine | 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz > with 256G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_job=3000 > | > | | nr_task=100% > | > | | runtime=300s > | > | | test=custom > | > | | ucode=0x3d > | reaim is generally a mess so in this case it's unclear. The load is a mix of task creation, IO operations, signal and others. It might have benefitted slightly from running local. One reason I don't particularly like reaim is that historically it was dominated by sending/receiving signals. In my own tests, signal is typically removed as well as it's tendency to sync the entire filesystem at high frequency. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | stream: stream.add_bandwidth_MBps -32.0% regression > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | array_size=10000000 > | > | | cpufreq_governor=performance > | > | | nr_threads=50% > | > | | omp=true > | > | | ucode=0xb00002e > | STREAM covered already other than noting that it's unlikely it has 44 memory channels to work with so any imbalance in the task distribution should show up as a regression. Again, the patch favours using local node first which would saturate the local memory channel earlier. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | plzip: > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_threads=100% > | > | | ucode=0xb00002e > | Doesn't state what change happened be it positive or negative. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | reaim: reaim.jobs_per_min -11.9% regression > | > | test machine | 192 threads Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz > with 512G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_task=100% > | > | | runtime=300s > | > | | test=all_utime > | > | | ucode=0xb00002e > | This is completely user-space bound running basic math operations. Not clear why it would suffer *but* if hyperthreading is enabled, the patch might mean that hyperthread siblings were used early due to favouring the local node. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | hackbench: hackbench.throughput -7.3% regression > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 64G memory | > | test parameters | cpufreq_governor=performance > | > | | ipc=pipe > | > | | mode=process > | > | | nr_threads=1600% > | > | | ucode=0xb00002e > | Hackbench very short-lived but the workload is also heavily saturating the machine to an extent where it would be hard to tell from this report if the 7.3% is statically significant or not. The patch might mean a socket is severely over-saturated in the very early phases of the workload. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | reaim: reaim.std_dev_percent 11.4% undefined > | > | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz > with 64G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_task=100% > | > | | runtime=300s > | > | | test=custom > | > | | ucode=0x200004d > | Not sure what the change is saying. Possibly that it's less variable. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | reaim: boot-time.boot 95.3% regression > | > | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz > with 64G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_task=100% > | > | | runtime=300s > | > | | test=alltests > | > | | ucode=0x200004d > | boot-time.boot? > +------------------+--------------------------------------------------------------------------+ > | testcase: change | pft: pft.faults_per_sec_per_cpu -42.7% regression > | > | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz > with 64G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_task=50% > | > | | runtime=300s > | > | | ucode=0x200004d > | PFT already discussed. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | stream: stream.add_bandwidth_MBps -28.8% regression > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | array_size=50000000 > | > | | cpufreq_governor=performance > | > | | nr_threads=50% > | > | | omp=true > | > +------------------+--------------------------------------------------------------------------+ > | testcase: change | stream: stream.add_bandwidth_MBps -30.6% regression > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | array_size=10000000 > | > | | cpufreq_governor=performance > | > | | nr_threads=50% > | > | | omp=true > | > +------------------+--------------------------------------------------------------------------+ > | testcase: change | pft: pft.faults_per_sec_per_cpu -42.5% regression > | > | test machine | 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz > with 64G memory | > | test parameters | cpufreq_governor=performance > | > | | nr_task=50% > | > | | runtime=300s > | Already discussed. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | reaim: reaim.child_systime -1.4% undefined > | > | test machine | 144 threads Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz > with 512G memory | > | test parameters | cpufreq_governor=performance > | > | | iterations=30 > | > | | nr_task=1600% > | > | | test=compute > | 1.4% change in system time could be overhead in the fork phase as it looks for local idle cores then remote idle cores early but the difference is tiny. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | stress-ng: stress-ng.fifo.ops_per_sec 76.2% improvement > | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | class=pipe > | > | | cpufreq_governor=performance > | > | | nr_threads=100% > | > | | testtime=1s > | A case where short-lived communicating tasks benefit by starting local. > +------------------+--------------------------------------------------------------------------+ > | testcase: change | stress-ng: stress-ng.tsearch.ops_per_sec -17.1% > regression | > | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > with 128G memory | > | test parameters | class=cpu > | > | | cpufreq_governor=performance > | > | | nr_threads=100% > | > | | testtime=1s > | > +------------------+--------------------------------------------------------------------------+ > Given full machine utilisation and a 1 second duration, it's a case where saturating the local node early was sub-optimal and 1 second is too long for load balancing or other factors to correct it. Bottom line, the patch is a trade off but from a range of tests, I found that on balance we benefit more from having tasks start local until there is evidence that the kernel is justified to spread the load to remote nodes. -- Mel Gorman SUSE Labs