Hi, These patches are an extension of the work done by Anton https://patchwork.ozlabs.org/patch/537621/, they'll need to be applied on top of them.
The goal of these patches is to rework how the 'math' registers (FP, VEC and VSX) are context switched. Currently the kernel adopts a lazy approach, always switching userspace tasks with all three facilities disabled and loads in each set of registers upon receiving each unavailable exception. The kernel does try to avoid disabling the features in the syscall quick path but it during testing it appears that even what should be a simple syscall still causes the kernel to use some facilities (vectorised memcpy for example) for its self and therefore disable it for the user task. The lazy approach makes for a small amount of time spent restoring userspace state and if tasks don't use any of these facilities it is the correct thing to do. In recent years, new workloads and new features such as auto vectorisation in GCC have meant that the use of these facilities by userspace has increased, so much so that some workloads can have a task take an FP unavailable exception and a VEC unavailable exception almost every time slice. This series removes the general laziness in favour of a more selective approach. If a task uses any of the 'math' facilities the kernel will load the registers and enable the facilities for future time slices as the assumption is that the use is likely to continue for some time. This removes the cost of having to take an exception. These patches also adds logic to detect if a task had been using a facility and optimises in the case where the registers are still hot, this provides another speedup as not only is the cost of the exception saved but the cost of copying up to 64 x 128 bit registers is also removed. With these patches applied on top of Antons patches I observe a significant improvement with Antons context switch microbenchmark using yield(): http://ozlabs.org/~anton/junkcode/context_switch2.c Using an LE kernel compiled with pseries_le_defconfig Running: ./context_switch2 --test=yield 8 8 and adding one of --fp, --altivec or --vector Gives a 5% improvement on a POWER8 CPU. ./context_switch2 --test=yield --fp --altivec --vector 8 8 Gives a 15% improvement on a POWER8 CPU. I'll take this opportunity to note that 15% can be somewhat misleading. It may be reasonable to assume that each of the optimisations has had a compounding effect, this isn't incorrect and the reason behind the apparent compounding reveals a lot about where the current bottleneck is. The tests always touch FP first, then VEC then VSX which is the guaranteed worst case for the way the kernel currently operates. This behaviour will trigger three subsequent unavailable exceptions. Since the kernel currently enables all three facilities after taking a VSX unavailable the tests can be modified to touch VSX->VEC->FP in this order the difference in performance when touching all three only 5%. There is a compounding effect in so far as the cost of taking multiple unavailable exception is removed. This testing also demonstrates that the cost of the exception is by far the most expensive part of the current lazy approach. Cyril Bur (8): selftests/powerpc: Test the preservation of FPU and VMX regs across syscall selftests/powerpc: Test preservation of FPU and VMX regs across preemption selftests/powerpc: Test FPU and VMX regs in signal ucontext powerpc: Explicitly disable math features when copying thread powerpc: Restore FPU/VEC/VSX if previously used powerpc: Add the ability to save FPU without giving it up powerpc: Add the ability to save Altivec without giving it up powerpc: Add the ability to save VSX without giving it up arch/powerpc/include/asm/processor.h | 2 + arch/powerpc/include/asm/switch_to.h | 5 +- arch/powerpc/kernel/asm-offsets.c | 2 + arch/powerpc/kernel/entry_64.S | 55 +++++- arch/powerpc/kernel/fpu.S | 25 +-- arch/powerpc/kernel/ppc_ksyms.c | 4 - arch/powerpc/kernel/process.c | 144 ++++++++++++-- arch/powerpc/kernel/vector.S | 45 +---- tools/testing/selftests/powerpc/Makefile | 3 +- tools/testing/selftests/powerpc/math/Makefile | 19 ++ tools/testing/selftests/powerpc/math/basic_asm.h | 26 +++ tools/testing/selftests/powerpc/math/fpu_asm.S | 185 +++++++++++++++++ tools/testing/selftests/powerpc/math/fpu_preempt.c | 92 +++++++++ tools/testing/selftests/powerpc/math/fpu_signal.c | 119 +++++++++++ tools/testing/selftests/powerpc/math/fpu_syscall.c | 79 ++++++++ tools/testing/selftests/powerpc/math/vmx_asm.S | 219 +++++++++++++++++++++ tools/testing/selftests/powerpc/math/vmx_preempt.c | 92 +++++++++ tools/testing/selftests/powerpc/math/vmx_signal.c | 124 ++++++++++++ tools/testing/selftests/powerpc/math/vmx_syscall.c | 81 ++++++++ 19 files changed, 1240 insertions(+), 81 deletions(-) create mode 100644 tools/testing/selftests/powerpc/math/Makefile create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c -- 2.6.2 _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev