Hi,
I'm investigating SIMD support to C++ compute kernel(not gandiva).
A typical case is the sum kernel[1]. Below tight loop can be easily optimized
with SIMD.
for (int64_t i = 0; i < length; i++) {
local.sum += values[i];
}
Compiler already does loop vectorization. But it's done at compile time without
knowledge of target cpu.
Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled
with only sse4 enabled is suboptimal on new hardware.
I have some proposals, would like to hear comments from community.
- Based on our experience of ISA-L[2] project(optimized storage acceleration
library for x86 and Arm), runtime dispatcher is a good approach. Basically, it
links in codes optimized for different cpu features(sse4,avx2,neon,...) and
selects the best one fits target cpu at first invocation. This is similar to
gcc indirect function[3], but doesn't depend on compilers.
- Use gcc FMV [4] to generate multiple binaries for one function. See sample
source and compiled code [5].
Though looks simple, it has many limitations: It's gcc specific feature, no
support from clang and msvc. It only works on x86, no Arm support.
I think this approach is no-go.
- Don't do it.
Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated
effort to do it in C++ kernel? Will these vetorizable computations move to
Gandiva in the future?
[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
[2] https://github.com/intel/isa-l
[3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
[4] https://lwn.net/Articles/691932/
[5] https://godbolt.org/z/ajpuq_