Re: [C++] Replacing xsimd with compiler autovectorization

Antoine Pitrou Wed, 30 Mar 2022 01:10:03 -0700


Hi Sasha,

Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit :

I've noticed that we include xsimd as an abstraction over all of the simd
architectures. I'd like to propose a different solution which would result
in fewer lines of code, while being more readable.

My thinking is that anything simple enough to abstract with xsimd can be
autovectorized by the compiler. Any more interesting SIMD algorithm usually
is tailored to the target instruction set and can't be abstracted away with
xsimd anyway.

As a matter of fact, we already rely on auto-vectorization in a coupleof places (mostly `aggregate_basic_avx2.cc` and friends).

The main concern with doing this is that auto-vectorization makesperformance difficult to predict and ensure. Some compiler or compilerversions may fail vectorizing a given piece of code (how does MSVC farethese days?). As a consequence, some Arrow builds may be 2x to 8x fasterthan others on the same machine and the same workload. This is not anoptimal user experience, and in many cases it can be difficult orimpossible to change the compiler version.

As matter of fact, on x86 our baseline instruction set is SSE4.2, so wealready get some auto-vectorization and we can look at the concretenumbers. On my AMD Zen 2 CPU I get the following numbers on the pairwiseaddition kernel (this is with clang 12.0):

https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5

Judging that PADDB, PADDW, PADDD and PADDQ all have the same throughputon that CPU (according tohttps://www.agner.org/optimize/instruction_tables.pdf), all these datatypes should show the same performance in bytes/second. Yet int64 getshalf of int16 and int32, and int8 is far behind.

Looking at the disassembly, the int16 and int32 versions are unrolledand vectorized by clang 12.0, the int8 and int64 are not... Why? How dowe ensure that common compiler optimize this routine as expected?Personally, I have no idea.

As for why xsimd rather than hand-written intrinsics, it's a middleground in terms of maintainability and readability. Code usingintrinsics quickly gets hard to follow except for the daily SIMDspecialist, and it's also annoying to port to another architecture (evendue to gratuitous naming differences).

As for why xsimd is not very much used accross the codebase, the initialintent was to have more SIMD-accelerated code, but nobody actually gotaround to do it, due to other priorities. In the end, we spend more timeadding features than hand-optimizing the existing code.


Regards

Antoine.


With that in mind, I'd like to propose the following strategy:
1. Write a single source file with simple, element-at-a-time for loop
implementations of each function.
2. Compile this same source file several times with different compile flags
for different vectorization (e.g. if we're on an x86 machine that supports
AVX2 and AVX512, we'd compile once with -mavx2 and once with -mavx512vl).
3. Functions compiled with different instruction sets can be differentiated
by a namespace, which gets defined during the compiler invocation. For
example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2 and then
for something like elementwise addition of two arrays, we'd call
arrow::AVX2::VectorAdd.

I believe this would let us remove xsimd as a dependency while also giving
us lots of vectorized kernels at the cost of some extra cmake magic. After
that, it would just be a matter of making the function registry point to
these new functions.

Please let me know your thoughts!

Thanks,
Sasha Krassovsky

Re: [C++] Replacing xsimd with compiler autovectorization

Reply via email to