Re: [C++][Discuss] Approaches for SIMD optimizations

Yibo Cai Tue, 09 Jun 2020 23:24:04 -0700

I did a quick investigation of libsimdpp, google highway[1], and 
Vc(std-simd)[2].

I tried rewriting simd utf8 validation code by unifying sse and neon intrinsics
with libsimdpp wrappers. You can compare simdpp/sse/neon code[3].
Utf8 validation is non-trivial, but the porting is straightforward, easier than
I thought. And simd wrappers unified mysterious names `_mm_shuffle_epi8`,
`vqtbl1q_u8` to more friendly names like `simdpp::permute_bytes16` or
`hwy::TableLookupBytes`.
But I failed finally due to some gaps of neon and sse4. Neon `tbl` supports
lookup multiple tables, and return 0 if index is out of bound[4]. Sse4 `pshufb`
lookup one table, and out of bound indices are handled much more complex[5]. In
this specific code, neon behaviour is convenient. To unify the code, I have to
abandon neon feature and sacrifice performance on arm. Of course it's also
possible to improve libsimdpp/highway.

I think this is common for all simd wrappers. It can unify most code if vector
length is the same. But there are always cases where arch dependent code is
necessary, such as above example or advanced features like avx512 mask, which
are not well supported by simd wrappers.

About performance, as described in google highway design philosophy[6], it
achieves portability, maintainabilty and readability by sacrificing 10-20%
performance. Sounds fair.

libsimdpp and highway look like mature products, claims to support
gcc/clang/msvc, c++11, x86/arm/ppc. std-simd only supports gcc-9+, and arm/ppc
support is poor now.

That said, I don't think leveraging simd wrapper alone will fix our problem,
especially the code size, both source and binary. They are just shallow
wrappers to intrinsics with more friendly api.

Personally, I prefer apply simd only to subroutines(e.g. the sum loop), not the
whole kernel. It's simpler, of course we need to prevent exploding ifdef.
Simd kernel shares many common code with base kernel, moving the code to
xxx_internal.h makes base kernel harder to read.

Besides, as Wes commented[7], it's better to put simd code in a standalone
shared lib. Binary may explodes quickly, we will need
#{i8,i16,i32,i64,float,double} * #{sse,avx,avx512 | neon,sve} simd code
instances for a simple sum operation, though simd wrapper may help reducing
source code size by its carefully designed templates.

[1] https://github.com/google/highway
[2] https://github.com/VcDevel/std-simd
[3] simdpp: https://github.com/cyb70289/utf8/blob/simdpp/range-simdpp.cc
sse: https://github.com/cyb70289/utf8/blob/simdpp/range-sse.c
neon: https://github.com/cyb70289/utf8/blob/simdpp/range-neon.c
[4]
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vqtbl2q_u8
[5]
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_shuffle_epi8&expand=5153
[6] https://github.com/google/highway#design-philosophy
[7] https://github.com/apache/arrow/pull/7314#issuecomment-638972317

Yibo

On 6/9/20 5:34 PM, Antoine Pitrou wrote:


Hello,

As part of https://github.com/apache/arrow/pull/7314, a discussion
started about our strategy for adding SIMD optimizations to various
routines and kernels.

Currently, we have no defined strategy and we have been adding
hand-written SIMD-optimized functions for particular primitives and
instruction sets, thanks to the submissions of contributors.  For
example, the above PR adds ~500 lines of code for the purpose of
accelerating the SUM kernel, when the input has no nulls, on the SSE
instruction set.

However, it seems that this ad hoc approach may not scale very well.
There are several widely-used SIMD instruction sets out there (the most
common being SSE[2], AVX[2], AVX512, Neon... I suppose ARM SVE will come
into play at some point), and there will be many potential functions to
optimize once we start writing a comprehensive library of computation
kernels.  Adding hand-written implementations, using intrinsic
functions, for each {routine, instruction set} pair threatens to create
a large maintenance burden.

In that PR, I suggested that we instead take a look at the SIMD wrapper
libraries available in C++.  There are several available:
* MIPP (https://github.com/aff3ct/MIPP)
* Vc (https://github.com/VcDevel/Vc)
* libsimdpp (https://github.com/p12tic/libsimdpp)
* (others yet)

In the course of the discussion, an interesting paper was mentioned:
https://dl.acm.org/doi/pdf/10.1145/3178433.3178435
together with an implementation comparison of a simple function:
https://gitlab.inria.fr/acassagn/mandelbrot

The SIMD wrappers met skepticism from Frank, the PR submitter, on the
basis that performance may not be optimal and that not all desired
features may be provided (such as runtime dispatching).

However, we also have to account that, without a wrapper library, we
will probably only integrate and maintain a small fraction of the
optimized routines that would be otherwise possible with a more
abstracted approach.  So, while the hand-written approach can be better
on a single {routine, instruction set} pair, it may lead to a globally
suboptimal situation (that is, unless the number of full-time developers
and maintainers on Arrow C++ inflates significantly).

Personally, I would like interested developers and contributors (such as
Micah, Frank, Yibo Cai) to hash out the various possible approaches, and
propose a way forward (which may be hybrid).

Regards

Antoine.

Re: [C++][Discuss] Approaches for SIMD optimizations

Reply via email to