Hello, on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we observe following performance degradation:
Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after booting the kernel: real 1m57.834s user 113m51.520s Then we disable and re-enable one core: echo 0 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu1/online and rerun the same test. Runtime is now degraded (by 40% for user time and by 30% for the real (wall-clock) time) using all 64 cores real 2m47.746s user 160m46.109s The issue was first reported in "The Linux Scheduler: a Decade of Wasted Cores" paper http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf https://github.com/jplozi/wastedcores/issues/1 How to reproduce the issue: A) Get benchmark and compile it: 1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz 2) tar zxvf NPB3.3.1.tar.gz 3) cd ~/NPB3.3.1/NPB3.3-OMP/config/ 4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler) 5) ln -sf NAS.samples/suite.def.lu suite.def 6) cd ~/NPB3.3.1/NPB3.3-OMP 7) make suite 8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin benchmarks lu.*. The binaries are alphabetically sorted by runtime with "lu.A.x" having the shortest runtime. B) Reproducing the issue (see also attached script) Remark: we have done the tests with autogroup disabled sysctl -w kernel.sched_autogroup_enabled=0 to avoid this issue on 4.7 kernel: https://bugzilla.kernel.org/show_bug.cgi?id=120481 The test was conducted on NUMA server with 4 nodes and using all available 64 cores. 1) (time bin/lu.C.x) |& tee $(uname -r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0 2) disable and re-enable one core echo 0 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu1/online 3) (time bin/lu.C.x) |& tee $(uname -r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0 grep "real\|user" *lu.C* You will see significant difference in both real and user time. Regarding to the authors of the paper the root cause of the problem is a missing call to regenerate domains inside NUMA nodes after re-enabling CPU. The problem was introduced in 3.19 kernel. The authors of paper has proposed a patch which applies to 4.1 kernel. Here is the link: https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch ===========For the completeness here are the results with 4.6 kernel=========== AFTER BOOT real 1m31.639s user 89m24.657s AFTER core has been disabled and re-enabled real 2m44.566s user 157m59.814s Please notice that 4.6 kernel problem is much more visible than with 4.7 rc5 kernel. At the same time, 4.6 kernel delivers much better performance after boot than 4.7 rc5 kernel which might indicate that another problem is in play. ================================================================= I have also tested kernel provided by Peter Zijlstra on Friday, June 24th which provides fix for https://bugzilla.kernel.org/show_bug.cgi?id=120481. It does not fix this issue and kernel right after boot performs worse than 4.6 kernel right after boot so we may in fact face two problems here. ========Results with 4.7.0-02548776ded1185e6e16ad0a475481e982741ee9 kernel===== git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent $ git rev-parse HEAD 02548776ded1185e6e16ad0a475481e982741ee9 AFTER BOOT real 1m58.549s user 113m31.448s AFTER core has been disabled and re-enabled real 2m35.930s user 148m20.795s ================================================================= Thanks a lot! Jirka PS: I have opened this BZ to track this issue Bug 121121 - Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core https://bugzilla.kernel.org/show_bug.cgi?id=121121
reproduce.sh
Description: Bourne shell script