Re: [C++] Replacing xsimd with compiler autovectorization

Micah Kornfield Wed, 30 Mar 2022 22:58:50 -0700

>
> As for a naming convention, we could use something like the prefix `simd_`?


Bikeshedding moment: we use suffixes today for instructions sets would it
make sense to continue that for consistency.

`scalar_arithmetic_simd.cc`?

On Wed, Mar 30, 2022 at 4:58 PM Sasha Krassovsky <krassovskysa...@gmail.com>
wrote:

> Yep, that's basically what I'm suggesting. If someone contributes an xsimd
> kernel that's faster than the autovectorized kernel, then it'll be seamless
> to switch.
> The xsimd and autovectorized kernels would share the same source file, so
> anyone contributing an xsimd kernel would just have to change that one
> function.
>
> As for a naming convention, we could use something like the prefix `simd_`?
> So for example scalar_arithmetic.cc would have a corresponding file
> `simd_scalar_arithmetic.cc`,
> which would be compiled for each simd instruction set that we're targeting.
> `scalar_arithmetic.cc` would then be responsible for generating the
> dispatch rules .
>
> Sasha
>
>
> On Wed, Mar 30, 2022 at 4:24 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > Apologies if this is an over simplified opinion but does it have to be
> one
> > or the other?
> >
> > If people contribute and maintain XSIMD kernels then great.
> >
> > If people contribute and maintain auto-vectorizable kernels then great.
> >
> > Then it just comes down to having consistent dispatch rules.  Something
> > like "use XSIMD kernels if it supports the arch otherwise fall back to
> > whatever else."
> >
> > The plan to generate different auto vectorized versions sounds
> reasonable.
> > Maybe cmake can even add the flags by convention based on the filename.
> If
> > it causes downstream cmake issues then we just add a flag to disable that
> > feature.
> >
> >
> > On Wed, Mar 30, 2022, 11:18 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > >
> > > > We were not going to use xsimd's dynamic dispatch, and instead roll
> our
> > > own
> > >
> > > There is already a dynamic dispatch facility see: arrow/util/dispatch.h
> > >
> > > Since we're rolling our own dynamic dispatch, we'd still have to
> compile
> > > > the same source file several times with different, so my proposal
> > doesn't
> > > > change that.
> > >
> > > If we are going to be compiling the same file over and over again, it
> > would
> > > be nice to have a naming convention for such files so they can be
> easily
> > > distinguished.  We'd need to tackle this complexity at some point, but
> > > trying to keep the mechanism understandable by people outside the
> project
> > > is something that we should evaluate as this is implemented.
> > >
> > > -Micah
> > >
> > >
> > >
> > > On Wed, Mar 30, 2022 at 1:19 PM Sasha Krassovsky <
> > > krassovskysa...@gmail.com>
> > > wrote:
> > >
> > > > > Looking at the disassembly, the int16 and int32 versions are
> unrolled
> > > and
> > > > vectorized by clang 12.0, the int8 and int64 are not...
> > > > I think a big part of this is how fragile the current kernel system
> > > > implementation is. It seems to rely on templating lots of different
> > parts
> > > > of kernels and hoping compiler will glue them together via inlining,
> > and
> > > > then vectorize.
> > > > GCC 10 does not unroll or vectorize even int32 or int16 for me. I am
> > able
> > > > to reproduce the vectorization you noticed with Clang (I'm actually
> > > really
> > > > impressed).
> > > > But still, as you said it's not ideal to have one compiler be much
> > faster
> > > > than another. The reason for the difference is possibly due to the
> > > ordering
> > > > of optimization passes. If a compiler does all of its inlining before
> > > > vectorization,
> > > > then we can expect good output. If vectorization happens between two
> > > > inlining phases, then the compiler cannot vectorize (even if
> something
> > > gets
> > > > inlined later). My personal rule of thumb to handle this is to
> > > > write code that would be optimized well within only a few
> optimization
> > > > passes (e.g. rely on an inlining, a constant propagation, and a
> > > > vectorization pass) instead of relying on the correct distribution
> of a
> > > > large number of passes.
> > > >
> > > > > How do we ensure that common compiler optimize this routine as
> > > expected?
> > > > I think the main thing we can do is to write simple code that's
> obvious
> > > for
> > > > the compiler to vectorize ("meet the compiler half way" in some
> sense).
> > > > A simple for loop operating on pointers is vectorized very well even
> by
> > > > ancient compilers (below I tested clang 3.9), so I think for a lot of
> > > these
> > > > kernels if we'd be much more confident of it being consistently
> > > vectorized.
> > > > https://godbolt.org/z/e36M948jv
> > > >
> > > > > therefore using xsimd does not prevent the compiler from optimizing
> > > more.
> > > > My main hesitation here is the same as above with the kernels system:
> > the
> > > > more inlining that has to happen, the less likely it is for the
> > compiler
> > > to
> > > > optimize as efficiently. However,
> > > > I do agree this is probably not a big issue with xsimd as it seems
> > pretty
> > > > lightweight.
> > > >
> > > > > we are very responsive to any request and we can quickly implement
> > > > features
> > > > This is great to hear, and definitely makes me feel alot better about
> > > > keeping and using xsimd where necessary.
> > > >
> > > > > Note however that it requires some basic checking of the output
> > across
> > > > different compilers and different compiler versions.
> > > > Definitely agree here too. I would be in favor of some middleground
> of
> > > > using autovectorization for simple stuff (like the Add kernel above),
> > and
> > > > use xsimd for stuff that doesn't get vectorized well by some
> compilers.
> > > >
> > > > So it seems to me that before my email, the current status was:
> > > > - We were going to simd-optimize our kernels using xsimd, but hadn't
> > > gotten
> > > > to it yet. This would've involved changing the kernels system to use
> > > xsimd
> > > > inside of the operation functors?
> > > > - We were not going to use xsimd's dynamic dispatch, and instead roll
> > our
> > > > own
> > > >
> > > > If that's the case then my proposal is relatively minor. Since we're
> > > > rolling our own dynamic dispatch, we'd still have to compile the same
> > > > source file several times with different, so my proposal doesn't
> change
> > > > that.
> > > > The part that my proposal changes is to refactor the code to be more
> > > > autovectorization-friendly. This seems to be a minor difference from
> > what
> > > > was planned before, where instead of having the operation functor
> > process
> > > > one simd-batch at a time, we have it process the whole batch (to
> > minimize
> > > > the reliance on inlining). And of course we'd have to actually
> > implement
> > > > the dynamic dispatch (I don't think I see it for most of the kernels
> > > > system).
> > > > Does something like this sound reasonable to people?
> > > >
> > > > Sasha
> > > >
> > > > On Wed, Mar 30, 2022 at 1:47 AM Johan Mabille <
> johan.mabi...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > xsimd core developer here writing on behalf of the xsimd core team
> ;)
> > > > >
> > > > > I just wanted to add some elements to this thread:
> > > > >
> > > > > - xsimd is more than a library that wraps simple intrinsics. It
> > > provides
> > > > > vectorized (and accurate) implementations of the traditional
> > > mathematical
> > > > > functions (exp, sin, cos , etc), that a compiler may not be able to
> > > > > vectorize. And when the compiler does vectorize them, it usually
> > relies
> > > > on
> > > > > external libraries that provide such implementations. Therefore,
> you
> > > can
> > > > > not guarantee that every compiler will optimize these calls, nor
> the
> > > same
> > > > > accuracy across different platforms.
> > > > >
> > > > > - compilers actually understand intrinsics and can optimize them;
> > it's
> > > > not
> > > > > a "one intrinsic -> one asm "instruction mapping at all. Therefore,
> > > using
> > > > > xsimd does not prevent the compiler from optimizing more.
> > > > >
> > > > > - xsimd is actively developed and maintained. It is used as a
> > building
> > > > > block of the xtensor stack, and in Pythran which has been
> integrated
> > in
> > > > > scipy. Some features may be missing because of a lack of time
> and/or
> > > > higher
> > > > > priorities. However, Antoine can confirm that we are very
> responsive
> > to
> > > > any
> > > > > request and we can quickly implement features that would be
> mandatory
> > > for
> > > > > Apache Arrow.
> > > > >
> > > > > - It is 100% true that for simple loops with simple arithmetic
> > > > operations,
> > > > > it is easier not to write xsimd code and let the compiler optimize
> > the
> > > > > loop. Note however that it requires some basic checking of the
> output
> > > > > across different compilers and different compiler versions. See for
> > > > > instance https://godbolt.org/z/KTcTe1zPn. Different versions of
> gcc
> > > > > generate different vectorized code, and clang and gcc do not
> > > > auto-vectorize
> > > > > at the same optimization level (O2 for clang and O3 or O2
> > > > -ftree-vectorize
> > > > > for gcc)
> > > > >
> > > > > Regards,
> > > > >
> > > > > Johan
> > > > >
> > > > > On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou <
> anto...@python.org>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Hi Sasha,
> > > > > >
> > > > > > Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit :
> > > > > > > I've noticed that we include xsimd as an abstraction over all
> of
> > > the
> > > > > simd
> > > > > > > architectures. I'd like to propose a different solution which
> > would
> > > > > > result
> > > > > > > in fewer lines of code, while being more readable.
> > > > > > >
> > > > > > > My thinking is that anything simple enough to abstract with
> xsimd
> > > can
> > > > > be
> > > > > > > autovectorized by the compiler. Any more interesting SIMD
> > algorithm
> > > > > > usually
> > > > > > > is tailored to the target instruction set and can't be
> abstracted
> > > > away
> > > > > > with
> > > > > > > xsimd anyway.
> > > > > >
> > > > > > As a matter of fact, we already rely on auto-vectorization in a
> > > couple
> > > > > > of places (mostly `aggregate_basic_avx2.cc` and friends).
> > > > > >
> > > > > > The main concern with doing this is that auto-vectorization makes
> > > > > > performance difficult to predict and ensure. Some compiler or
> > > compiler
> > > > > > versions may fail vectorizing a given piece of code (how does
> MSVC
> > > fare
> > > > > > these days?). As a consequence, some Arrow builds may be 2x to 8x
> > > > faster
> > > > > > than others on the same machine and the same workload. This is
> not
> > an
> > > > > > optimal user experience, and in many cases it can be difficult or
> > > > > > impossible to change the compiler version.
> > > > > >
> > > > > >
> > > > > > As matter of fact, on x86 our baseline instruction set is SSE4.2,
> > so
> > > we
> > > > > > already get some auto-vectorization and we can look at the
> concrete
> > > > > > numbers. On my AMD Zen 2 CPU I get the following numbers on the
> > > > pairwise
> > > > > > addition kernel (this is with clang 12.0):
> > > > > > https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5
> > > > > >
> > > > > > Judging that PADDB, PADDW, PADDD and PADDQ all have the same
> > > throughput
> > > > > > on that CPU (according to
> > > > > > https://www.agner.org/optimize/instruction_tables.pdf), all
> these
> > > data
> > > > > > types should show the same performance in bytes/second.  Yet
> int64
> > > gets
> > > > > > half of int16 and int32, and int8 is far behind.
> > > > > >
> > > > > > Looking at the disassembly, the int16 and int32 versions are
> > unrolled
> > > > > > and vectorized by clang 12.0, the int8 and int64 are not... Why?
> > How
> > > do
> > > > > > we ensure that common compiler optimize this routine as expected?
> > > > > > Personally, I have no idea.
> > > > > >
> > > > > >
> > > > > > As for why xsimd rather than hand-written intrinsics, it's a
> middle
> > > > > > ground in terms of maintainability and readability. Code using
> > > > > > intrinsics quickly gets hard to follow except for the daily SIMD
> > > > > > specialist, and it's also annoying to port to another
> architecture
> > > > (even
> > > > > > due to gratuitous naming differences).
> > > > > >
> > > > > > As for why xsimd is not very much used accross the codebase, the
> > > > initial
> > > > > > intent was to have more SIMD-accelerated code, but nobody
> actually
> > > got
> > > > > > around to do it, due to other priorities. In the end, we spend
> more
> > > > time
> > > > > > adding features than hand-optimizing the existing code.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > With that in mind, I'd like to propose the following strategy:
> > > > > > > 1. Write a single source file with simple, element-at-a-time
> for
> > > loop
> > > > > > > implementations of each function.
> > > > > > > 2. Compile this same source file several times with different
> > > compile
> > > > > > flags
> > > > > > > for different vectorization (e.g. if we're on an x86 machine
> that
> > > > > > supports
> > > > > > > AVX2 and AVX512, we'd compile once with -mavx2 and once with
> > > > > -mavx512vl).
> > > > > > > 3. Functions compiled with different instruction sets can be
> > > > > > differentiated
> > > > > > > by a namespace, which gets defined during the compiler
> > invocation.
> > > > For
> > > > > > > example, for AVX2 we'd invoke the compiler with
> -DNAMESPACE=AVX2
> > > and
> > > > > then
> > > > > > > for something like elementwise addition of two arrays, we'd
> call
> > > > > > > arrow::AVX2::VectorAdd.
> > > > > > >
> > > > > > > I believe this would let us remove xsimd as a dependency while
> > also
> > > > > > giving
> > > > > > > us lots of vectorized kernels at the cost of some extra cmake
> > > magic.
> > > > > > After
> > > > > > > that, it would just be a matter of making the function registry
> > > point
> > > > > to
> > > > > > > these new functions.
> > > > > > >
> > > > > > > Please let me know your thoughts!
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sasha Krassovsky
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [C++] Replacing xsimd with compiler autovectorization

Reply via email to