> > As for a naming convention, we could use something like the prefix `simd_`?
Bikeshedding moment: we use suffixes today for instructions sets would it make sense to continue that for consistency. `scalar_arithmetic_simd.cc`? On Wed, Mar 30, 2022 at 4:58 PM Sasha Krassovsky <krassovskysa...@gmail.com> wrote: > Yep, that's basically what I'm suggesting. If someone contributes an xsimd > kernel that's faster than the autovectorized kernel, then it'll be seamless > to switch. > The xsimd and autovectorized kernels would share the same source file, so > anyone contributing an xsimd kernel would just have to change that one > function. > > As for a naming convention, we could use something like the prefix `simd_`? > So for example scalar_arithmetic.cc would have a corresponding file > `simd_scalar_arithmetic.cc`, > which would be compiled for each simd instruction set that we're targeting. > `scalar_arithmetic.cc` would then be responsible for generating the > dispatch rules . > > Sasha > > > On Wed, Mar 30, 2022 at 4:24 PM Weston Pace <weston.p...@gmail.com> wrote: > > > Apologies if this is an over simplified opinion but does it have to be > one > > or the other? > > > > If people contribute and maintain XSIMD kernels then great. > > > > If people contribute and maintain auto-vectorizable kernels then great. > > > > Then it just comes down to having consistent dispatch rules. Something > > like "use XSIMD kernels if it supports the arch otherwise fall back to > > whatever else." > > > > The plan to generate different auto vectorized versions sounds > reasonable. > > Maybe cmake can even add the flags by convention based on the filename. > If > > it causes downstream cmake issues then we just add a flag to disable that > > feature. > > > > > > On Wed, Mar 30, 2022, 11:18 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > > > > > We were not going to use xsimd's dynamic dispatch, and instead roll > our > > > own > > > > > > There is already a dynamic dispatch facility see: arrow/util/dispatch.h > > > > > > Since we're rolling our own dynamic dispatch, we'd still have to > compile > > > > the same source file several times with different, so my proposal > > doesn't > > > > change that. > > > > > > If we are going to be compiling the same file over and over again, it > > would > > > be nice to have a naming convention for such files so they can be > easily > > > distinguished. We'd need to tackle this complexity at some point, but > > > trying to keep the mechanism understandable by people outside the > project > > > is something that we should evaluate as this is implemented. > > > > > > -Micah > > > > > > > > > > > > On Wed, Mar 30, 2022 at 1:19 PM Sasha Krassovsky < > > > krassovskysa...@gmail.com> > > > wrote: > > > > > > > > Looking at the disassembly, the int16 and int32 versions are > unrolled > > > and > > > > vectorized by clang 12.0, the int8 and int64 are not... > > > > I think a big part of this is how fragile the current kernel system > > > > implementation is. It seems to rely on templating lots of different > > parts > > > > of kernels and hoping compiler will glue them together via inlining, > > and > > > > then vectorize. > > > > GCC 10 does not unroll or vectorize even int32 or int16 for me. I am > > able > > > > to reproduce the vectorization you noticed with Clang (I'm actually > > > really > > > > impressed). > > > > But still, as you said it's not ideal to have one compiler be much > > faster > > > > than another. The reason for the difference is possibly due to the > > > ordering > > > > of optimization passes. If a compiler does all of its inlining before > > > > vectorization, > > > > then we can expect good output. If vectorization happens between two > > > > inlining phases, then the compiler cannot vectorize (even if > something > > > gets > > > > inlined later). My personal rule of thumb to handle this is to > > > > write code that would be optimized well within only a few > optimization > > > > passes (e.g. rely on an inlining, a constant propagation, and a > > > > vectorization pass) instead of relying on the correct distribution > of a > > > > large number of passes. > > > > > > > > > How do we ensure that common compiler optimize this routine as > > > expected? > > > > I think the main thing we can do is to write simple code that's > obvious > > > for > > > > the compiler to vectorize ("meet the compiler half way" in some > sense). > > > > A simple for loop operating on pointers is vectorized very well even > by > > > > ancient compilers (below I tested clang 3.9), so I think for a lot of > > > these > > > > kernels if we'd be much more confident of it being consistently > > > vectorized. > > > > https://godbolt.org/z/e36M948jv > > > > > > > > > therefore using xsimd does not prevent the compiler from optimizing > > > more. > > > > My main hesitation here is the same as above with the kernels system: > > the > > > > more inlining that has to happen, the less likely it is for the > > compiler > > > to > > > > optimize as efficiently. However, > > > > I do agree this is probably not a big issue with xsimd as it seems > > pretty > > > > lightweight. > > > > > > > > > we are very responsive to any request and we can quickly implement > > > > features > > > > This is great to hear, and definitely makes me feel alot better about > > > > keeping and using xsimd where necessary. > > > > > > > > > Note however that it requires some basic checking of the output > > across > > > > different compilers and different compiler versions. > > > > Definitely agree here too. I would be in favor of some middleground > of > > > > using autovectorization for simple stuff (like the Add kernel above), > > and > > > > use xsimd for stuff that doesn't get vectorized well by some > compilers. > > > > > > > > So it seems to me that before my email, the current status was: > > > > - We were going to simd-optimize our kernels using xsimd, but hadn't > > > gotten > > > > to it yet. This would've involved changing the kernels system to use > > > xsimd > > > > inside of the operation functors? > > > > - We were not going to use xsimd's dynamic dispatch, and instead roll > > our > > > > own > > > > > > > > If that's the case then my proposal is relatively minor. Since we're > > > > rolling our own dynamic dispatch, we'd still have to compile the same > > > > source file several times with different, so my proposal doesn't > change > > > > that. > > > > The part that my proposal changes is to refactor the code to be more > > > > autovectorization-friendly. This seems to be a minor difference from > > what > > > > was planned before, where instead of having the operation functor > > process > > > > one simd-batch at a time, we have it process the whole batch (to > > minimize > > > > the reliance on inlining). And of course we'd have to actually > > implement > > > > the dynamic dispatch (I don't think I see it for most of the kernels > > > > system). > > > > Does something like this sound reasonable to people? > > > > > > > > Sasha > > > > > > > > On Wed, Mar 30, 2022 at 1:47 AM Johan Mabille < > johan.mabi...@gmail.com > > > > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > xsimd core developer here writing on behalf of the xsimd core team > ;) > > > > > > > > > > I just wanted to add some elements to this thread: > > > > > > > > > > - xsimd is more than a library that wraps simple intrinsics. It > > > provides > > > > > vectorized (and accurate) implementations of the traditional > > > mathematical > > > > > functions (exp, sin, cos , etc), that a compiler may not be able to > > > > > vectorize. And when the compiler does vectorize them, it usually > > relies > > > > on > > > > > external libraries that provide such implementations. Therefore, > you > > > can > > > > > not guarantee that every compiler will optimize these calls, nor > the > > > same > > > > > accuracy across different platforms. > > > > > > > > > > - compilers actually understand intrinsics and can optimize them; > > it's > > > > not > > > > > a "one intrinsic -> one asm "instruction mapping at all. Therefore, > > > using > > > > > xsimd does not prevent the compiler from optimizing more. > > > > > > > > > > - xsimd is actively developed and maintained. It is used as a > > building > > > > > block of the xtensor stack, and in Pythran which has been > integrated > > in > > > > > scipy. Some features may be missing because of a lack of time > and/or > > > > higher > > > > > priorities. However, Antoine can confirm that we are very > responsive > > to > > > > any > > > > > request and we can quickly implement features that would be > mandatory > > > for > > > > > Apache Arrow. > > > > > > > > > > - It is 100% true that for simple loops with simple arithmetic > > > > operations, > > > > > it is easier not to write xsimd code and let the compiler optimize > > the > > > > > loop. Note however that it requires some basic checking of the > output > > > > > across different compilers and different compiler versions. See for > > > > > instance https://godbolt.org/z/KTcTe1zPn. Different versions of > gcc > > > > > generate different vectorized code, and clang and gcc do not > > > > auto-vectorize > > > > > at the same optimization level (O2 for clang and O3 or O2 > > > > -ftree-vectorize > > > > > for gcc) > > > > > > > > > > Regards, > > > > > > > > > > Johan > > > > > > > > > > On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou < > anto...@python.org> > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Sasha, > > > > > > > > > > > > Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit : > > > > > > > I've noticed that we include xsimd as an abstraction over all > of > > > the > > > > > simd > > > > > > > architectures. I'd like to propose a different solution which > > would > > > > > > result > > > > > > > in fewer lines of code, while being more readable. > > > > > > > > > > > > > > My thinking is that anything simple enough to abstract with > xsimd > > > can > > > > > be > > > > > > > autovectorized by the compiler. Any more interesting SIMD > > algorithm > > > > > > usually > > > > > > > is tailored to the target instruction set and can't be > abstracted > > > > away > > > > > > with > > > > > > > xsimd anyway. > > > > > > > > > > > > As a matter of fact, we already rely on auto-vectorization in a > > > couple > > > > > > of places (mostly `aggregate_basic_avx2.cc` and friends). > > > > > > > > > > > > The main concern with doing this is that auto-vectorization makes > > > > > > performance difficult to predict and ensure. Some compiler or > > > compiler > > > > > > versions may fail vectorizing a given piece of code (how does > MSVC > > > fare > > > > > > these days?). As a consequence, some Arrow builds may be 2x to 8x > > > > faster > > > > > > than others on the same machine and the same workload. This is > not > > an > > > > > > optimal user experience, and in many cases it can be difficult or > > > > > > impossible to change the compiler version. > > > > > > > > > > > > > > > > > > As matter of fact, on x86 our baseline instruction set is SSE4.2, > > so > > > we > > > > > > already get some auto-vectorization and we can look at the > concrete > > > > > > numbers. On my AMD Zen 2 CPU I get the following numbers on the > > > > pairwise > > > > > > addition kernel (this is with clang 12.0): > > > > > > https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5 > > > > > > > > > > > > Judging that PADDB, PADDW, PADDD and PADDQ all have the same > > > throughput > > > > > > on that CPU (according to > > > > > > https://www.agner.org/optimize/instruction_tables.pdf), all > these > > > data > > > > > > types should show the same performance in bytes/second. Yet > int64 > > > gets > > > > > > half of int16 and int32, and int8 is far behind. > > > > > > > > > > > > Looking at the disassembly, the int16 and int32 versions are > > unrolled > > > > > > and vectorized by clang 12.0, the int8 and int64 are not... Why? > > How > > > do > > > > > > we ensure that common compiler optimize this routine as expected? > > > > > > Personally, I have no idea. > > > > > > > > > > > > > > > > > > As for why xsimd rather than hand-written intrinsics, it's a > middle > > > > > > ground in terms of maintainability and readability. Code using > > > > > > intrinsics quickly gets hard to follow except for the daily SIMD > > > > > > specialist, and it's also annoying to port to another > architecture > > > > (even > > > > > > due to gratuitous naming differences). > > > > > > > > > > > > As for why xsimd is not very much used accross the codebase, the > > > > initial > > > > > > intent was to have more SIMD-accelerated code, but nobody > actually > > > got > > > > > > around to do it, due to other priorities. In the end, we spend > more > > > > time > > > > > > adding features than hand-optimizing the existing code. > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > With that in mind, I'd like to propose the following strategy: > > > > > > > 1. Write a single source file with simple, element-at-a-time > for > > > loop > > > > > > > implementations of each function. > > > > > > > 2. Compile this same source file several times with different > > > compile > > > > > > flags > > > > > > > for different vectorization (e.g. if we're on an x86 machine > that > > > > > > supports > > > > > > > AVX2 and AVX512, we'd compile once with -mavx2 and once with > > > > > -mavx512vl). > > > > > > > 3. Functions compiled with different instruction sets can be > > > > > > differentiated > > > > > > > by a namespace, which gets defined during the compiler > > invocation. > > > > For > > > > > > > example, for AVX2 we'd invoke the compiler with > -DNAMESPACE=AVX2 > > > and > > > > > then > > > > > > > for something like elementwise addition of two arrays, we'd > call > > > > > > > arrow::AVX2::VectorAdd. > > > > > > > > > > > > > > I believe this would let us remove xsimd as a dependency while > > also > > > > > > giving > > > > > > > us lots of vectorized kernels at the cost of some extra cmake > > > magic. > > > > > > After > > > > > > > that, it would just be a matter of making the function registry > > > point > > > > > to > > > > > > > these new functions. > > > > > > > > > > > > > > Please let me know your thoughts! > > > > > > > > > > > > > > Thanks, > > > > > > > Sasha Krassovsky > > > > > > > > > > > > > > > > > > > > > > > > > > > >