Yep a suffix _simd would make perfect sense too. Sasha
> 30 марта 2022 г., в 22:58, Micah Kornfield <emkornfi...@gmail.com> написал(а): > > >> >> >> As for a naming convention, we could use something like the prefix `simd_`? > > Bikeshedding moment: we use suffixes today for instructions sets would it > make sense to continue that for consistency. > > `scalar_arithmetic_simd.cc`? > >> On Wed, Mar 30, 2022 at 4:58 PM Sasha Krassovsky <krassovskysa...@gmail.com> >> wrote: >> >> Yep, that's basically what I'm suggesting. If someone contributes an xsimd >> kernel that's faster than the autovectorized kernel, then it'll be seamless >> to switch. >> The xsimd and autovectorized kernels would share the same source file, so >> anyone contributing an xsimd kernel would just have to change that one >> function. >> >> As for a naming convention, we could use something like the prefix `simd_`? >> So for example scalar_arithmetic.cc would have a corresponding file >> `simd_scalar_arithmetic.cc`, >> which would be compiled for each simd instruction set that we're targeting. >> `scalar_arithmetic.cc` would then be responsible for generating the >> dispatch rules . >> >> Sasha >> >> >>> On Wed, Mar 30, 2022 at 4:24 PM Weston Pace <weston.p...@gmail.com> wrote: >>> >>> Apologies if this is an over simplified opinion but does it have to be >> one >>> or the other? >>> >>> If people contribute and maintain XSIMD kernels then great. >>> >>> If people contribute and maintain auto-vectorizable kernels then great. >>> >>> Then it just comes down to having consistent dispatch rules. Something >>> like "use XSIMD kernels if it supports the arch otherwise fall back to >>> whatever else." >>> >>> The plan to generate different auto vectorized versions sounds >> reasonable. >>> Maybe cmake can even add the flags by convention based on the filename. >> If >>> it causes downstream cmake issues then we just add a flag to disable that >>> feature. >>> >>> >>> On Wed, Mar 30, 2022, 11:18 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>>> >>>>> We were not going to use xsimd's dynamic dispatch, and instead roll >> our >>>> own >>>> >>>> There is already a dynamic dispatch facility see: arrow/util/dispatch.h >>>> >>>> Since we're rolling our own dynamic dispatch, we'd still have to >> compile >>>>> the same source file several times with different, so my proposal >>> doesn't >>>>> change that. >>>> >>>> If we are going to be compiling the same file over and over again, it >>> would >>>> be nice to have a naming convention for such files so they can be >> easily >>>> distinguished. We'd need to tackle this complexity at some point, but >>>> trying to keep the mechanism understandable by people outside the >> project >>>> is something that we should evaluate as this is implemented. >>>> >>>> -Micah >>>> >>>> >>>> >>>> On Wed, Mar 30, 2022 at 1:19 PM Sasha Krassovsky < >>>> krassovskysa...@gmail.com> >>>> wrote: >>>> >>>>>> Looking at the disassembly, the int16 and int32 versions are >> unrolled >>>> and >>>>> vectorized by clang 12.0, the int8 and int64 are not... >>>>> I think a big part of this is how fragile the current kernel system >>>>> implementation is. It seems to rely on templating lots of different >>> parts >>>>> of kernels and hoping compiler will glue them together via inlining, >>> and >>>>> then vectorize. >>>>> GCC 10 does not unroll or vectorize even int32 or int16 for me. I am >>> able >>>>> to reproduce the vectorization you noticed with Clang (I'm actually >>>> really >>>>> impressed). >>>>> But still, as you said it's not ideal to have one compiler be much >>> faster >>>>> than another. The reason for the difference is possibly due to the >>>> ordering >>>>> of optimization passes. If a compiler does all of its inlining before >>>>> vectorization, >>>>> then we can expect good output. If vectorization happens between two >>>>> inlining phases, then the compiler cannot vectorize (even if >> something >>>> gets >>>>> inlined later). My personal rule of thumb to handle this is to >>>>> write code that would be optimized well within only a few >> optimization >>>>> passes (e.g. rely on an inlining, a constant propagation, and a >>>>> vectorization pass) instead of relying on the correct distribution >> of a >>>>> large number of passes. >>>>> >>>>>> How do we ensure that common compiler optimize this routine as >>>> expected? >>>>> I think the main thing we can do is to write simple code that's >> obvious >>>> for >>>>> the compiler to vectorize ("meet the compiler half way" in some >> sense). >>>>> A simple for loop operating on pointers is vectorized very well even >> by >>>>> ancient compilers (below I tested clang 3.9), so I think for a lot of >>>> these >>>>> kernels if we'd be much more confident of it being consistently >>>> vectorized. >>>>> https://godbolt.org/z/e36M948jv >>>>> >>>>>> therefore using xsimd does not prevent the compiler from optimizing >>>> more. >>>>> My main hesitation here is the same as above with the kernels system: >>> the >>>>> more inlining that has to happen, the less likely it is for the >>> compiler >>>> to >>>>> optimize as efficiently. However, >>>>> I do agree this is probably not a big issue with xsimd as it seems >>> pretty >>>>> lightweight. >>>>> >>>>>> we are very responsive to any request and we can quickly implement >>>>> features >>>>> This is great to hear, and definitely makes me feel alot better about >>>>> keeping and using xsimd where necessary. >>>>> >>>>>> Note however that it requires some basic checking of the output >>> across >>>>> different compilers and different compiler versions. >>>>> Definitely agree here too. I would be in favor of some middleground >> of >>>>> using autovectorization for simple stuff (like the Add kernel above), >>> and >>>>> use xsimd for stuff that doesn't get vectorized well by some >> compilers. >>>>> >>>>> So it seems to me that before my email, the current status was: >>>>> - We were going to simd-optimize our kernels using xsimd, but hadn't >>>> gotten >>>>> to it yet. This would've involved changing the kernels system to use >>>> xsimd >>>>> inside of the operation functors? >>>>> - We were not going to use xsimd's dynamic dispatch, and instead roll >>> our >>>>> own >>>>> >>>>> If that's the case then my proposal is relatively minor. Since we're >>>>> rolling our own dynamic dispatch, we'd still have to compile the same >>>>> source file several times with different, so my proposal doesn't >> change >>>>> that. >>>>> The part that my proposal changes is to refactor the code to be more >>>>> autovectorization-friendly. This seems to be a minor difference from >>> what >>>>> was planned before, where instead of having the operation functor >>> process >>>>> one simd-batch at a time, we have it process the whole batch (to >>> minimize >>>>> the reliance on inlining). And of course we'd have to actually >>> implement >>>>> the dynamic dispatch (I don't think I see it for most of the kernels >>>>> system). >>>>> Does something like this sound reasonable to people? >>>>> >>>>> Sasha >>>>> >>>>> On Wed, Mar 30, 2022 at 1:47 AM Johan Mabille < >> johan.mabi...@gmail.com >>>> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> xsimd core developer here writing on behalf of the xsimd core team >> ;) >>>>>> >>>>>> I just wanted to add some elements to this thread: >>>>>> >>>>>> - xsimd is more than a library that wraps simple intrinsics. It >>>> provides >>>>>> vectorized (and accurate) implementations of the traditional >>>> mathematical >>>>>> functions (exp, sin, cos , etc), that a compiler may not be able to >>>>>> vectorize. And when the compiler does vectorize them, it usually >>> relies >>>>> on >>>>>> external libraries that provide such implementations. Therefore, >> you >>>> can >>>>>> not guarantee that every compiler will optimize these calls, nor >> the >>>> same >>>>>> accuracy across different platforms. >>>>>> >>>>>> - compilers actually understand intrinsics and can optimize them; >>> it's >>>>> not >>>>>> a "one intrinsic -> one asm "instruction mapping at all. Therefore, >>>> using >>>>>> xsimd does not prevent the compiler from optimizing more. >>>>>> >>>>>> - xsimd is actively developed and maintained. It is used as a >>> building >>>>>> block of the xtensor stack, and in Pythran which has been >> integrated >>> in >>>>>> scipy. Some features may be missing because of a lack of time >> and/or >>>>> higher >>>>>> priorities. However, Antoine can confirm that we are very >> responsive >>> to >>>>> any >>>>>> request and we can quickly implement features that would be >> mandatory >>>> for >>>>>> Apache Arrow. >>>>>> >>>>>> - It is 100% true that for simple loops with simple arithmetic >>>>> operations, >>>>>> it is easier not to write xsimd code and let the compiler optimize >>> the >>>>>> loop. Note however that it requires some basic checking of the >> output >>>>>> across different compilers and different compiler versions. See for >>>>>> instance https://godbolt.org/z/KTcTe1zPn. Different versions of >> gcc >>>>>> generate different vectorized code, and clang and gcc do not >>>>> auto-vectorize >>>>>> at the same optimization level (O2 for clang and O3 or O2 >>>>> -ftree-vectorize >>>>>> for gcc) >>>>>> >>>>>> Regards, >>>>>> >>>>>> Johan >>>>>> >>>>>> On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou < >> anto...@python.org> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> Hi Sasha, >>>>>>> >>>>>>> Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit : >>>>>>>> I've noticed that we include xsimd as an abstraction over all >> of >>>> the >>>>>> simd >>>>>>>> architectures. I'd like to propose a different solution which >>> would >>>>>>> result >>>>>>>> in fewer lines of code, while being more readable. >>>>>>>> >>>>>>>> My thinking is that anything simple enough to abstract with >> xsimd >>>> can >>>>>> be >>>>>>>> autovectorized by the compiler. Any more interesting SIMD >>> algorithm >>>>>>> usually >>>>>>>> is tailored to the target instruction set and can't be >> abstracted >>>>> away >>>>>>> with >>>>>>>> xsimd anyway. >>>>>>> >>>>>>> As a matter of fact, we already rely on auto-vectorization in a >>>> couple >>>>>>> of places (mostly `aggregate_basic_avx2.cc` and friends). >>>>>>> >>>>>>> The main concern with doing this is that auto-vectorization makes >>>>>>> performance difficult to predict and ensure. Some compiler or >>>> compiler >>>>>>> versions may fail vectorizing a given piece of code (how does >> MSVC >>>> fare >>>>>>> these days?). As a consequence, some Arrow builds may be 2x to 8x >>>>> faster >>>>>>> than others on the same machine and the same workload. This is >> not >>> an >>>>>>> optimal user experience, and in many cases it can be difficult or >>>>>>> impossible to change the compiler version. >>>>>>> >>>>>>> >>>>>>> As matter of fact, on x86 our baseline instruction set is SSE4.2, >>> so >>>> we >>>>>>> already get some auto-vectorization and we can look at the >> concrete >>>>>>> numbers. On my AMD Zen 2 CPU I get the following numbers on the >>>>> pairwise >>>>>>> addition kernel (this is with clang 12.0): >>>>>>> https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5 >>>>>>> >>>>>>> Judging that PADDB, PADDW, PADDD and PADDQ all have the same >>>> throughput >>>>>>> on that CPU (according to >>>>>>> https://www.agner.org/optimize/instruction_tables.pdf), all >> these >>>> data >>>>>>> types should show the same performance in bytes/second. Yet >> int64 >>>> gets >>>>>>> half of int16 and int32, and int8 is far behind. >>>>>>> >>>>>>> Looking at the disassembly, the int16 and int32 versions are >>> unrolled >>>>>>> and vectorized by clang 12.0, the int8 and int64 are not... Why? >>> How >>>> do >>>>>>> we ensure that common compiler optimize this routine as expected? >>>>>>> Personally, I have no idea. >>>>>>> >>>>>>> >>>>>>> As for why xsimd rather than hand-written intrinsics, it's a >> middle >>>>>>> ground in terms of maintainability and readability. Code using >>>>>>> intrinsics quickly gets hard to follow except for the daily SIMD >>>>>>> specialist, and it's also annoying to port to another >> architecture >>>>> (even >>>>>>> due to gratuitous naming differences). >>>>>>> >>>>>>> As for why xsimd is not very much used accross the codebase, the >>>>> initial >>>>>>> intent was to have more SIMD-accelerated code, but nobody >> actually >>>> got >>>>>>> around to do it, due to other priorities. In the end, we spend >> more >>>>> time >>>>>>> adding features than hand-optimizing the existing code. >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Antoine. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> With that in mind, I'd like to propose the following strategy: >>>>>>>> 1. Write a single source file with simple, element-at-a-time >> for >>>> loop >>>>>>>> implementations of each function. >>>>>>>> 2. Compile this same source file several times with different >>>> compile >>>>>>> flags >>>>>>>> for different vectorization (e.g. if we're on an x86 machine >> that >>>>>>> supports >>>>>>>> AVX2 and AVX512, we'd compile once with -mavx2 and once with >>>>>> -mavx512vl). >>>>>>>> 3. Functions compiled with different instruction sets can be >>>>>>> differentiated >>>>>>>> by a namespace, which gets defined during the compiler >>> invocation. >>>>> For >>>>>>>> example, for AVX2 we'd invoke the compiler with >> -DNAMESPACE=AVX2 >>>> and >>>>>> then >>>>>>>> for something like elementwise addition of two arrays, we'd >> call >>>>>>>> arrow::AVX2::VectorAdd. >>>>>>>> >>>>>>>> I believe this would let us remove xsimd as a dependency while >>> also >>>>>>> giving >>>>>>>> us lots of vectorized kernels at the cost of some extra cmake >>>> magic. >>>>>>> After >>>>>>>> that, it would just be a matter of making the function registry >>>> point >>>>>> to >>>>>>>> these new functions. >>>>>>>> >>>>>>>> Please let me know your thoughts! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sasha Krassovsky >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>