> Looking at the disassembly, the int16 and int32 versions are unrolled and vectorized by clang 12.0, the int8 and int64 are not... I think a big part of this is how fragile the current kernel system implementation is. It seems to rely on templating lots of different parts of kernels and hoping compiler will glue them together via inlining, and then vectorize. GCC 10 does not unroll or vectorize even int32 or int16 for me. I am able to reproduce the vectorization you noticed with Clang (I'm actually really impressed). But still, as you said it's not ideal to have one compiler be much faster than another. The reason for the difference is possibly due to the ordering of optimization passes. If a compiler does all of its inlining before vectorization, then we can expect good output. If vectorization happens between two inlining phases, then the compiler cannot vectorize (even if something gets inlined later). My personal rule of thumb to handle this is to write code that would be optimized well within only a few optimization passes (e.g. rely on an inlining, a constant propagation, and a vectorization pass) instead of relying on the correct distribution of a large number of passes.
> How do we ensure that common compiler optimize this routine as expected? I think the main thing we can do is to write simple code that's obvious for the compiler to vectorize ("meet the compiler half way" in some sense). A simple for loop operating on pointers is vectorized very well even by ancient compilers (below I tested clang 3.9), so I think for a lot of these kernels if we'd be much more confident of it being consistently vectorized. https://godbolt.org/z/e36M948jv > therefore using xsimd does not prevent the compiler from optimizing more. My main hesitation here is the same as above with the kernels system: the more inlining that has to happen, the less likely it is for the compiler to optimize as efficiently. However, I do agree this is probably not a big issue with xsimd as it seems pretty lightweight. > we are very responsive to any request and we can quickly implement features This is great to hear, and definitely makes me feel alot better about keeping and using xsimd where necessary. > Note however that it requires some basic checking of the output across different compilers and different compiler versions. Definitely agree here too. I would be in favor of some middleground of using autovectorization for simple stuff (like the Add kernel above), and use xsimd for stuff that doesn't get vectorized well by some compilers. So it seems to me that before my email, the current status was: - We were going to simd-optimize our kernels using xsimd, but hadn't gotten to it yet. This would've involved changing the kernels system to use xsimd inside of the operation functors? - We were not going to use xsimd's dynamic dispatch, and instead roll our own If that's the case then my proposal is relatively minor. Since we're rolling our own dynamic dispatch, we'd still have to compile the same source file several times with different, so my proposal doesn't change that. The part that my proposal changes is to refactor the code to be more autovectorization-friendly. This seems to be a minor difference from what was planned before, where instead of having the operation functor process one simd-batch at a time, we have it process the whole batch (to minimize the reliance on inlining). And of course we'd have to actually implement the dynamic dispatch (I don't think I see it for most of the kernels system). Does something like this sound reasonable to people? Sasha On Wed, Mar 30, 2022 at 1:47 AM Johan Mabille <johan.mabi...@gmail.com> wrote: > Hi all, > > xsimd core developer here writing on behalf of the xsimd core team ;) > > I just wanted to add some elements to this thread: > > - xsimd is more than a library that wraps simple intrinsics. It provides > vectorized (and accurate) implementations of the traditional mathematical > functions (exp, sin, cos , etc), that a compiler may not be able to > vectorize. And when the compiler does vectorize them, it usually relies on > external libraries that provide such implementations. Therefore, you can > not guarantee that every compiler will optimize these calls, nor the same > accuracy across different platforms. > > - compilers actually understand intrinsics and can optimize them; it's not > a "one intrinsic -> one asm "instruction mapping at all. Therefore, using > xsimd does not prevent the compiler from optimizing more. > > - xsimd is actively developed and maintained. It is used as a building > block of the xtensor stack, and in Pythran which has been integrated in > scipy. Some features may be missing because of a lack of time and/or higher > priorities. However, Antoine can confirm that we are very responsive to any > request and we can quickly implement features that would be mandatory for > Apache Arrow. > > - It is 100% true that for simple loops with simple arithmetic operations, > it is easier not to write xsimd code and let the compiler optimize the > loop. Note however that it requires some basic checking of the output > across different compilers and different compiler versions. See for > instance https://godbolt.org/z/KTcTe1zPn. Different versions of gcc > generate different vectorized code, and clang and gcc do not auto-vectorize > at the same optimization level (O2 for clang and O3 or O2 -ftree-vectorize > for gcc) > > Regards, > > Johan > > On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > Hi Sasha, > > > > Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit : > > > I've noticed that we include xsimd as an abstraction over all of the > simd > > > architectures. I'd like to propose a different solution which would > > result > > > in fewer lines of code, while being more readable. > > > > > > My thinking is that anything simple enough to abstract with xsimd can > be > > > autovectorized by the compiler. Any more interesting SIMD algorithm > > usually > > > is tailored to the target instruction set and can't be abstracted away > > with > > > xsimd anyway. > > > > As a matter of fact, we already rely on auto-vectorization in a couple > > of places (mostly `aggregate_basic_avx2.cc` and friends). > > > > The main concern with doing this is that auto-vectorization makes > > performance difficult to predict and ensure. Some compiler or compiler > > versions may fail vectorizing a given piece of code (how does MSVC fare > > these days?). As a consequence, some Arrow builds may be 2x to 8x faster > > than others on the same machine and the same workload. This is not an > > optimal user experience, and in many cases it can be difficult or > > impossible to change the compiler version. > > > > > > As matter of fact, on x86 our baseline instruction set is SSE4.2, so we > > already get some auto-vectorization and we can look at the concrete > > numbers. On my AMD Zen 2 CPU I get the following numbers on the pairwise > > addition kernel (this is with clang 12.0): > > https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5 > > > > Judging that PADDB, PADDW, PADDD and PADDQ all have the same throughput > > on that CPU (according to > > https://www.agner.org/optimize/instruction_tables.pdf), all these data > > types should show the same performance in bytes/second. Yet int64 gets > > half of int16 and int32, and int8 is far behind. > > > > Looking at the disassembly, the int16 and int32 versions are unrolled > > and vectorized by clang 12.0, the int8 and int64 are not... Why? How do > > we ensure that common compiler optimize this routine as expected? > > Personally, I have no idea. > > > > > > As for why xsimd rather than hand-written intrinsics, it's a middle > > ground in terms of maintainability and readability. Code using > > intrinsics quickly gets hard to follow except for the daily SIMD > > specialist, and it's also annoying to port to another architecture (even > > due to gratuitous naming differences). > > > > As for why xsimd is not very much used accross the codebase, the initial > > intent was to have more SIMD-accelerated code, but nobody actually got > > around to do it, due to other priorities. In the end, we spend more time > > adding features than hand-optimizing the existing code. > > > > Regards > > > > Antoine. > > > > > > > > > > > > With that in mind, I'd like to propose the following strategy: > > > 1. Write a single source file with simple, element-at-a-time for loop > > > implementations of each function. > > > 2. Compile this same source file several times with different compile > > flags > > > for different vectorization (e.g. if we're on an x86 machine that > > supports > > > AVX2 and AVX512, we'd compile once with -mavx2 and once with > -mavx512vl). > > > 3. Functions compiled with different instruction sets can be > > differentiated > > > by a namespace, which gets defined during the compiler invocation. For > > > example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2 and > then > > > for something like elementwise addition of two arrays, we'd call > > > arrow::AVX2::VectorAdd. > > > > > > I believe this would let us remove xsimd as a dependency while also > > giving > > > us lots of vectorized kernels at the cost of some extra cmake magic. > > After > > > that, it would just be a matter of making the function registry point > to > > > these new functions. > > > > > > Please let me know your thoughts! > > > > > > Thanks, > > > Sasha Krassovsky > > > > > >