I did a quick investigation of libsimdpp, google highway[1], and Vc(std-simd)[2].
I tried rewriting simd utf8 validation code by unifying sse and neon intrinsics with libsimdpp wrappers. You can compare simdpp/sse/neon code[3]. Utf8 validation is non-trivial, but the porting is straightforward, easier than I thought. And simd wrappers unified mysterious names `_mm_shuffle_epi8`, `vqtbl1q_u8` to more friendly names like `simdpp::permute_bytes16` or `hwy::TableLookupBytes`. But I failed finally due to some gaps of neon and sse4. Neon `tbl` supports lookup multiple tables, and return 0 if index is out of bound[4]. Sse4 `pshufb` lookup one table, and out of bound indices are handled much more complex[5]. In this specific code, neon behaviour is convenient. To unify the code, I have to abandon neon feature and sacrifice performance on arm. Of course it's also possible to improve libsimdpp/highway. I think this is common for all simd wrappers. It can unify most code if vector length is the same. But there are always cases where arch dependent code is necessary, such as above example or advanced features like avx512 mask, which are not well supported by simd wrappers. About performance, as described in google highway design philosophy[6], it achieves portability, maintainabilty and readability by sacrificing 10-20% performance. Sounds fair. libsimdpp and highway look like mature products, claims to support gcc/clang/msvc, c++11, x86/arm/ppc. std-simd only supports gcc-9+, and arm/ppc support is poor now. That said, I don't think leveraging simd wrapper alone will fix our problem, especially the code size, both source and binary. They are just shallow wrappers to intrinsics with more friendly api. Personally, I prefer apply simd only to subroutines(e.g. the sum loop), not the whole kernel. It's simpler, of course we need to prevent exploding ifdef. Simd kernel shares many common code with base kernel, moving the code to xxx_internal.h makes base kernel harder to read. Besides, as Wes commented[7], it's better to put simd code in a standalone shared lib. Binary may explodes quickly, we will need #{i8,i16,i32,i64,float,double} * #{sse,avx,avx512 | neon,sve} simd code instances for a simple sum operation, though simd wrapper may help reducing source code size by its carefully designed templates. [1] https://github.com/google/highway [2] https://github.com/VcDevel/std-simd [3] simdpp: https://github.com/cyb70289/utf8/blob/simdpp/range-simdpp.cc sse: https://github.com/cyb70289/utf8/blob/simdpp/range-sse.c neon: https://github.com/cyb70289/utf8/blob/simdpp/range-neon.c [4] https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vqtbl2q_u8 [5] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_shuffle_epi8&expand=5153 [6] https://github.com/google/highway#design-philosophy [7] https://github.com/apache/arrow/pull/7314#issuecomment-638972317 Yibo On 6/9/20 5:34 PM, Antoine Pitrou wrote:
Hello, As part of https://github.com/apache/arrow/pull/7314, a discussion started about our strategy for adding SIMD optimizations to various routines and kernels. Currently, we have no defined strategy and we have been adding hand-written SIMD-optimized functions for particular primitives and instruction sets, thanks to the submissions of contributors. For example, the above PR adds ~500 lines of code for the purpose of accelerating the SUM kernel, when the input has no nulls, on the SSE instruction set. However, it seems that this ad hoc approach may not scale very well. There are several widely-used SIMD instruction sets out there (the most common being SSE[2], AVX[2], AVX512, Neon... I suppose ARM SVE will come into play at some point), and there will be many potential functions to optimize once we start writing a comprehensive library of computation kernels. Adding hand-written implementations, using intrinsic functions, for each {routine, instruction set} pair threatens to create a large maintenance burden. In that PR, I suggested that we instead take a look at the SIMD wrapper libraries available in C++. There are several available: * MIPP (https://github.com/aff3ct/MIPP) * Vc (https://github.com/VcDevel/Vc) * libsimdpp (https://github.com/p12tic/libsimdpp) * (others yet) In the course of the discussion, an interesting paper was mentioned: https://dl.acm.org/doi/pdf/10.1145/3178433.3178435 together with an implementation comparison of a simple function: https://gitlab.inria.fr/acassagn/mandelbrot The SIMD wrappers met skepticism from Frank, the PR submitter, on the basis that performance may not be optimal and that not all desired features may be provided (such as runtime dispatching). However, we also have to account that, without a wrapper library, we will probably only integrate and maintain a small fraction of the optimized routines that would be otherwise possible with a more abstracted approach. So, while the hand-written approach can be better on a single {routine, instruction set} pair, it may lead to a globally suboptimal situation (that is, unless the number of full-time developers and maintainers on Arrow C++ inflates significantly). Personally, I would like interested developers and contributors (such as Micah, Frank, Yibo Cai) to hash out the various possible approaches, and propose a way forward (which may be hybrid). Regards Antoine.