Re: [C++] Replacing xsimd with compiler autovectorization

Antoine Pitrou Wed, 30 Mar 2022 23:38:04 -0700


Le 31/03/2022 à 01:24, Weston Pace a écrit :

Apologies if this is an over simplified opinion but does it have to be one
or the other?

If people contribute and maintain XSIMD kernels then great.

If people contribute and maintain auto-vectorizable kernels then great.

Then it just comes down to having consistent dispatch rules.  Something
like "use XSIMD kernels if it supports the arch otherwise fall back to
whatever else."

The plan to generate different auto vectorized versions sounds reasonable.
Maybe cmake can even add the flags by convention based on the filename.  If
it causes downstream cmake issues then we just add a flag to disable that
feature.

As I showed, those auto-vectorized kernels may be vectorized only insome situations, depending on the compiler version, the inputdatatypes... If we go through the pain of building the scaffolding forthat, I'd rather see those kernels explicitly vectorized using xsimd.


Regards

Antoine.



On Wed, Mar 30, 2022, 11:18 AM Micah Kornfield <[email protected]>
wrote:


We were not going to use xsimd's dynamic dispatch, and instead roll our

own

There is already a dynamic dispatch facility see: arrow/util/dispatch.h

Since we're rolling our own dynamic dispatch, we'd still have to compile

the same source file several times with different, so my proposal doesn't
change that.


If we are going to be compiling the same file over and over again, it would
be nice to have a naming convention for such files so they can be easily
distinguished.  We'd need to tackle this complexity at some point, but
trying to keep the mechanism understandable by people outside the project
is something that we should evaluate as this is implemented.

-Micah



On Wed, Mar 30, 2022 at 1:19 PM Sasha Krassovsky <
[email protected]>
wrote:

Looking at the disassembly, the int16 and int32 versions are unrolled

and

vectorized by clang 12.0, the int8 and int64 are not...
I think a big part of this is how fragile the current kernel system
implementation is. It seems to rely on templating lots of different parts
of kernels and hoping compiler will glue them together via inlining, and
then vectorize.
GCC 10 does not unroll or vectorize even int32 or int16 for me. I am able
to reproduce the vectorization you noticed with Clang (I'm actually

really

impressed).
But still, as you said it's not ideal to have one compiler be much faster
than another. The reason for the difference is possibly due to the

ordering

of optimization passes. If a compiler does all of its inlining before
vectorization,
then we can expect good output. If vectorization happens between two
inlining phases, then the compiler cannot vectorize (even if something

gets

inlined later). My personal rule of thumb to handle this is to
write code that would be optimized well within only a few optimization
passes (e.g. rely on an inlining, a constant propagation, and a
vectorization pass) instead of relying on the correct distribution of a
large number of passes.

How do we ensure that common compiler optimize this routine as

expected?

I think the main thing we can do is to write simple code that's obvious

for

the compiler to vectorize ("meet the compiler half way" in some sense).
A simple for loop operating on pointers is vectorized very well even by
ancient compilers (below I tested clang 3.9), so I think for a lot of

these

kernels if we'd be much more confident of it being consistently

vectorized.

https://godbolt.org/z/e36M948jv

therefore using xsimd does not prevent the compiler from optimizing

more.

My main hesitation here is the same as above with the kernels system: the
more inlining that has to happen, the less likely it is for the compiler

to

optimize as efficiently. However,
I do agree this is probably not a big issue with xsimd as it seems pretty
lightweight.

we are very responsive to any request and we can quickly implement

features
This is great to hear, and definitely makes me feel alot better about
keeping and using xsimd where necessary.

Note however that it requires some basic checking of the output across

different compilers and different compiler versions.
Definitely agree here too. I would be in favor of some middleground of
using autovectorization for simple stuff (like the Add kernel above), and
use xsimd for stuff that doesn't get vectorized well by some compilers.

So it seems to me that before my email, the current status was:
- We were going to simd-optimize our kernels using xsimd, but hadn't

gotten

to it yet. This would've involved changing the kernels system to use

xsimd

inside of the operation functors?
- We were not going to use xsimd's dynamic dispatch, and instead roll our
own

If that's the case then my proposal is relatively minor. Since we're
rolling our own dynamic dispatch, we'd still have to compile the same
source file several times with different, so my proposal doesn't change
that.
The part that my proposal changes is to refactor the code to be more
autovectorization-friendly. This seems to be a minor difference from what
was planned before, where instead of having the operation functor process
one simd-batch at a time, we have it process the whole batch (to minimize
the reliance on inlining). And of course we'd have to actually implement
the dynamic dispatch (I don't think I see it for most of the kernels
system).
Does something like this sound reasonable to people?

Sasha

On Wed, Mar 30, 2022 at 1:47 AM Johan Mabille <[email protected]>
wrote:

Hi all,

xsimd core developer here writing on behalf of the xsimd core team ;)

I just wanted to add some elements to this thread:

- xsimd is more than a library that wraps simple intrinsics. It

provides

vectorized (and accurate) implementations of the traditional

mathematical

functions (exp, sin, cos , etc), that a compiler may not be able to
vectorize. And when the compiler does vectorize them, it usually relies

on

external libraries that provide such implementations. Therefore, you

can

not guarantee that every compiler will optimize these calls, nor the

same

accuracy across different platforms.

- compilers actually understand intrinsics and can optimize them; it's

not

a "one intrinsic -> one asm "instruction mapping at all. Therefore,

using

xsimd does not prevent the compiler from optimizing more.

- xsimd is actively developed and maintained. It is used as a building
block of the xtensor stack, and in Pythran which has been integrated in
scipy. Some features may be missing because of a lack of time and/or

higher

priorities. However, Antoine can confirm that we are very responsive to

any

request and we can quickly implement features that would be mandatory

for

Apache Arrow.

- It is 100% true that for simple loops with simple arithmetic

operations,

it is easier not to write xsimd code and let the compiler optimize the
loop. Note however that it requires some basic checking of the output
across different compilers and different compiler versions. See for
instance https://godbolt.org/z/KTcTe1zPn. Different versions of gcc
generate different vectorized code, and clang and gcc do not

auto-vectorize

at the same optimization level (O2 for clang and O3 or O2

-ftree-vectorize

for gcc)

Regards,

Johan

On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou <[email protected]>
wrote:


Hi Sasha,

Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit :

I've noticed that we include xsimd as an abstraction over all of

the

simd

architectures. I'd like to propose a different solution which would

result

in fewer lines of code, while being more readable.

My thinking is that anything simple enough to abstract with xsimd

can

be

autovectorized by the compiler. Any more interesting SIMD algorithm

usually

is tailored to the target instruction set and can't be abstracted

away

with

xsimd anyway.


As a matter of fact, we already rely on auto-vectorization in a

couple

of places (mostly `aggregate_basic_avx2.cc` and friends).

The main concern with doing this is that auto-vectorization makes
performance difficult to predict and ensure. Some compiler or

compiler

versions may fail vectorizing a given piece of code (how does MSVC

fare

these days?). As a consequence, some Arrow builds may be 2x to 8x

faster

than others on the same machine and the same workload. This is not an
optimal user experience, and in many cases it can be difficult or
impossible to change the compiler version.


As matter of fact, on x86 our baseline instruction set is SSE4.2, so

we

already get some auto-vectorization and we can look at the concrete
numbers. On my AMD Zen 2 CPU I get the following numbers on the

pairwise

addition kernel (this is with clang 12.0):
https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5

Judging that PADDB, PADDW, PADDD and PADDQ all have the same

throughput

on that CPU (according to
https://www.agner.org/optimize/instruction_tables.pdf), all these

data

types should show the same performance in bytes/second.  Yet int64

gets

half of int16 and int32, and int8 is far behind.

Looking at the disassembly, the int16 and int32 versions are unrolled
and vectorized by clang 12.0, the int8 and int64 are not... Why? How

do

we ensure that common compiler optimize this routine as expected?
Personally, I have no idea.


As for why xsimd rather than hand-written intrinsics, it's a middle
ground in terms of maintainability and readability. Code using
intrinsics quickly gets hard to follow except for the daily SIMD
specialist, and it's also annoying to port to another architecture

(even

due to gratuitous naming differences).

As for why xsimd is not very much used accross the codebase, the

initial

intent was to have more SIMD-accelerated code, but nobody actually

got

around to do it, due to other priorities. In the end, we spend more

time

adding features than hand-optimizing the existing code.

Regards

Antoine.


With that in mind, I'd like to propose the following strategy:
1. Write a single source file with simple, element-at-a-time for

loop

implementations of each function.
2. Compile this same source file several times with different

compile

flags

for different vectorization (e.g. if we're on an x86 machine that

supports

AVX2 and AVX512, we'd compile once with -mavx2 and once with

-mavx512vl).

3. Functions compiled with different instruction sets can be

differentiated

by a namespace, which gets defined during the compiler invocation.

For

example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2

and

then

for something like elementwise addition of two arrays, we'd call
arrow::AVX2::VectorAdd.

I believe this would let us remove xsimd as a dependency while also

giving

us lots of vectorized kernels at the cost of some extra cmake

magic.

After

that, it would just be a matter of making the function registry

point

to

these new functions.

Please let me know your thoughts!

Thanks,
Sasha Krassovsky

Re: [C++] Replacing xsimd with compiler autovectorization

Reply via email to