Yep a suffix _simd would make perfect sense too. 

Sasha 

> 30 марта 2022 г., в 22:58, Micah Kornfield <emkornfi...@gmail.com> написал(а):
> 
> 
>> 
>> 
>> As for a naming convention, we could use something like the prefix `simd_`?
> 
> Bikeshedding moment: we use suffixes today for instructions sets would it
> make sense to continue that for consistency.
> 
> `scalar_arithmetic_simd.cc`?
> 
>> On Wed, Mar 30, 2022 at 4:58 PM Sasha Krassovsky <krassovskysa...@gmail.com>
>> wrote:
>> 
>> Yep, that's basically what I'm suggesting. If someone contributes an xsimd
>> kernel that's faster than the autovectorized kernel, then it'll be seamless
>> to switch.
>> The xsimd and autovectorized kernels would share the same source file, so
>> anyone contributing an xsimd kernel would just have to change that one
>> function.
>> 
>> As for a naming convention, we could use something like the prefix `simd_`?
>> So for example scalar_arithmetic.cc would have a corresponding file
>> `simd_scalar_arithmetic.cc`,
>> which would be compiled for each simd instruction set that we're targeting.
>> `scalar_arithmetic.cc` would then be responsible for generating the
>> dispatch rules .
>> 
>> Sasha
>> 
>> 
>>> On Wed, Mar 30, 2022 at 4:24 PM Weston Pace <weston.p...@gmail.com> wrote:
>>> 
>>> Apologies if this is an over simplified opinion but does it have to be
>> one
>>> or the other?
>>> 
>>> If people contribute and maintain XSIMD kernels then great.
>>> 
>>> If people contribute and maintain auto-vectorizable kernels then great.
>>> 
>>> Then it just comes down to having consistent dispatch rules.  Something
>>> like "use XSIMD kernels if it supports the arch otherwise fall back to
>>> whatever else."
>>> 
>>> The plan to generate different auto vectorized versions sounds
>> reasonable.
>>> Maybe cmake can even add the flags by convention based on the filename.
>> If
>>> it causes downstream cmake issues then we just add a flag to disable that
>>> feature.
>>> 
>>> 
>>> On Wed, Mar 30, 2022, 11:18 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>> 
>>>>> 
>>>>> We were not going to use xsimd's dynamic dispatch, and instead roll
>> our
>>>> own
>>>> 
>>>> There is already a dynamic dispatch facility see: arrow/util/dispatch.h
>>>> 
>>>> Since we're rolling our own dynamic dispatch, we'd still have to
>> compile
>>>>> the same source file several times with different, so my proposal
>>> doesn't
>>>>> change that.
>>>> 
>>>> If we are going to be compiling the same file over and over again, it
>>> would
>>>> be nice to have a naming convention for such files so they can be
>> easily
>>>> distinguished.  We'd need to tackle this complexity at some point, but
>>>> trying to keep the mechanism understandable by people outside the
>> project
>>>> is something that we should evaluate as this is implemented.
>>>> 
>>>> -Micah
>>>> 
>>>> 
>>>> 
>>>> On Wed, Mar 30, 2022 at 1:19 PM Sasha Krassovsky <
>>>> krassovskysa...@gmail.com>
>>>> wrote:
>>>> 
>>>>>> Looking at the disassembly, the int16 and int32 versions are
>> unrolled
>>>> and
>>>>> vectorized by clang 12.0, the int8 and int64 are not...
>>>>> I think a big part of this is how fragile the current kernel system
>>>>> implementation is. It seems to rely on templating lots of different
>>> parts
>>>>> of kernels and hoping compiler will glue them together via inlining,
>>> and
>>>>> then vectorize.
>>>>> GCC 10 does not unroll or vectorize even int32 or int16 for me. I am
>>> able
>>>>> to reproduce the vectorization you noticed with Clang (I'm actually
>>>> really
>>>>> impressed).
>>>>> But still, as you said it's not ideal to have one compiler be much
>>> faster
>>>>> than another. The reason for the difference is possibly due to the
>>>> ordering
>>>>> of optimization passes. If a compiler does all of its inlining before
>>>>> vectorization,
>>>>> then we can expect good output. If vectorization happens between two
>>>>> inlining phases, then the compiler cannot vectorize (even if
>> something
>>>> gets
>>>>> inlined later). My personal rule of thumb to handle this is to
>>>>> write code that would be optimized well within only a few
>> optimization
>>>>> passes (e.g. rely on an inlining, a constant propagation, and a
>>>>> vectorization pass) instead of relying on the correct distribution
>> of a
>>>>> large number of passes.
>>>>> 
>>>>>> How do we ensure that common compiler optimize this routine as
>>>> expected?
>>>>> I think the main thing we can do is to write simple code that's
>> obvious
>>>> for
>>>>> the compiler to vectorize ("meet the compiler half way" in some
>> sense).
>>>>> A simple for loop operating on pointers is vectorized very well even
>> by
>>>>> ancient compilers (below I tested clang 3.9), so I think for a lot of
>>>> these
>>>>> kernels if we'd be much more confident of it being consistently
>>>> vectorized.
>>>>> https://godbolt.org/z/e36M948jv
>>>>> 
>>>>>> therefore using xsimd does not prevent the compiler from optimizing
>>>> more.
>>>>> My main hesitation here is the same as above with the kernels system:
>>> the
>>>>> more inlining that has to happen, the less likely it is for the
>>> compiler
>>>> to
>>>>> optimize as efficiently. However,
>>>>> I do agree this is probably not a big issue with xsimd as it seems
>>> pretty
>>>>> lightweight.
>>>>> 
>>>>>> we are very responsive to any request and we can quickly implement
>>>>> features
>>>>> This is great to hear, and definitely makes me feel alot better about
>>>>> keeping and using xsimd where necessary.
>>>>> 
>>>>>> Note however that it requires some basic checking of the output
>>> across
>>>>> different compilers and different compiler versions.
>>>>> Definitely agree here too. I would be in favor of some middleground
>> of
>>>>> using autovectorization for simple stuff (like the Add kernel above),
>>> and
>>>>> use xsimd for stuff that doesn't get vectorized well by some
>> compilers.
>>>>> 
>>>>> So it seems to me that before my email, the current status was:
>>>>> - We were going to simd-optimize our kernels using xsimd, but hadn't
>>>> gotten
>>>>> to it yet. This would've involved changing the kernels system to use
>>>> xsimd
>>>>> inside of the operation functors?
>>>>> - We were not going to use xsimd's dynamic dispatch, and instead roll
>>> our
>>>>> own
>>>>> 
>>>>> If that's the case then my proposal is relatively minor. Since we're
>>>>> rolling our own dynamic dispatch, we'd still have to compile the same
>>>>> source file several times with different, so my proposal doesn't
>> change
>>>>> that.
>>>>> The part that my proposal changes is to refactor the code to be more
>>>>> autovectorization-friendly. This seems to be a minor difference from
>>> what
>>>>> was planned before, where instead of having the operation functor
>>> process
>>>>> one simd-batch at a time, we have it process the whole batch (to
>>> minimize
>>>>> the reliance on inlining). And of course we'd have to actually
>>> implement
>>>>> the dynamic dispatch (I don't think I see it for most of the kernels
>>>>> system).
>>>>> Does something like this sound reasonable to people?
>>>>> 
>>>>> Sasha
>>>>> 
>>>>> On Wed, Mar 30, 2022 at 1:47 AM Johan Mabille <
>> johan.mabi...@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> xsimd core developer here writing on behalf of the xsimd core team
>> ;)
>>>>>> 
>>>>>> I just wanted to add some elements to this thread:
>>>>>> 
>>>>>> - xsimd is more than a library that wraps simple intrinsics. It
>>>> provides
>>>>>> vectorized (and accurate) implementations of the traditional
>>>> mathematical
>>>>>> functions (exp, sin, cos , etc), that a compiler may not be able to
>>>>>> vectorize. And when the compiler does vectorize them, it usually
>>> relies
>>>>> on
>>>>>> external libraries that provide such implementations. Therefore,
>> you
>>>> can
>>>>>> not guarantee that every compiler will optimize these calls, nor
>> the
>>>> same
>>>>>> accuracy across different platforms.
>>>>>> 
>>>>>> - compilers actually understand intrinsics and can optimize them;
>>> it's
>>>>> not
>>>>>> a "one intrinsic -> one asm "instruction mapping at all. Therefore,
>>>> using
>>>>>> xsimd does not prevent the compiler from optimizing more.
>>>>>> 
>>>>>> - xsimd is actively developed and maintained. It is used as a
>>> building
>>>>>> block of the xtensor stack, and in Pythran which has been
>> integrated
>>> in
>>>>>> scipy. Some features may be missing because of a lack of time
>> and/or
>>>>> higher
>>>>>> priorities. However, Antoine can confirm that we are very
>> responsive
>>> to
>>>>> any
>>>>>> request and we can quickly implement features that would be
>> mandatory
>>>> for
>>>>>> Apache Arrow.
>>>>>> 
>>>>>> - It is 100% true that for simple loops with simple arithmetic
>>>>> operations,
>>>>>> it is easier not to write xsimd code and let the compiler optimize
>>> the
>>>>>> loop. Note however that it requires some basic checking of the
>> output
>>>>>> across different compilers and different compiler versions. See for
>>>>>> instance https://godbolt.org/z/KTcTe1zPn. Different versions of
>> gcc
>>>>>> generate different vectorized code, and clang and gcc do not
>>>>> auto-vectorize
>>>>>> at the same optimization level (O2 for clang and O3 or O2
>>>>> -ftree-vectorize
>>>>>> for gcc)
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Johan
>>>>>> 
>>>>>> On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou <
>> anto...@python.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi Sasha,
>>>>>>> 
>>>>>>> Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit :
>>>>>>>> I've noticed that we include xsimd as an abstraction over all
>> of
>>>> the
>>>>>> simd
>>>>>>>> architectures. I'd like to propose a different solution which
>>> would
>>>>>>> result
>>>>>>>> in fewer lines of code, while being more readable.
>>>>>>>> 
>>>>>>>> My thinking is that anything simple enough to abstract with
>> xsimd
>>>> can
>>>>>> be
>>>>>>>> autovectorized by the compiler. Any more interesting SIMD
>>> algorithm
>>>>>>> usually
>>>>>>>> is tailored to the target instruction set and can't be
>> abstracted
>>>>> away
>>>>>>> with
>>>>>>>> xsimd anyway.
>>>>>>> 
>>>>>>> As a matter of fact, we already rely on auto-vectorization in a
>>>> couple
>>>>>>> of places (mostly `aggregate_basic_avx2.cc` and friends).
>>>>>>> 
>>>>>>> The main concern with doing this is that auto-vectorization makes
>>>>>>> performance difficult to predict and ensure. Some compiler or
>>>> compiler
>>>>>>> versions may fail vectorizing a given piece of code (how does
>> MSVC
>>>> fare
>>>>>>> these days?). As a consequence, some Arrow builds may be 2x to 8x
>>>>> faster
>>>>>>> than others on the same machine and the same workload. This is
>> not
>>> an
>>>>>>> optimal user experience, and in many cases it can be difficult or
>>>>>>> impossible to change the compiler version.
>>>>>>> 
>>>>>>> 
>>>>>>> As matter of fact, on x86 our baseline instruction set is SSE4.2,
>>> so
>>>> we
>>>>>>> already get some auto-vectorization and we can look at the
>> concrete
>>>>>>> numbers. On my AMD Zen 2 CPU I get the following numbers on the
>>>>> pairwise
>>>>>>> addition kernel (this is with clang 12.0):
>>>>>>> https://gist.github.com/pitrou/dbfa3d97afb4ebbe096bab69cb6bb4d5
>>>>>>> 
>>>>>>> Judging that PADDB, PADDW, PADDD and PADDQ all have the same
>>>> throughput
>>>>>>> on that CPU (according to
>>>>>>> https://www.agner.org/optimize/instruction_tables.pdf), all
>> these
>>>> data
>>>>>>> types should show the same performance in bytes/second.  Yet
>> int64
>>>> gets
>>>>>>> half of int16 and int32, and int8 is far behind.
>>>>>>> 
>>>>>>> Looking at the disassembly, the int16 and int32 versions are
>>> unrolled
>>>>>>> and vectorized by clang 12.0, the int8 and int64 are not... Why?
>>> How
>>>> do
>>>>>>> we ensure that common compiler optimize this routine as expected?
>>>>>>> Personally, I have no idea.
>>>>>>> 
>>>>>>> 
>>>>>>> As for why xsimd rather than hand-written intrinsics, it's a
>> middle
>>>>>>> ground in terms of maintainability and readability. Code using
>>>>>>> intrinsics quickly gets hard to follow except for the daily SIMD
>>>>>>> specialist, and it's also annoying to port to another
>> architecture
>>>>> (even
>>>>>>> due to gratuitous naming differences).
>>>>>>> 
>>>>>>> As for why xsimd is not very much used accross the codebase, the
>>>>> initial
>>>>>>> intent was to have more SIMD-accelerated code, but nobody
>> actually
>>>> got
>>>>>>> around to do it, due to other priorities. In the end, we spend
>> more
>>>>> time
>>>>>>> adding features than hand-optimizing the existing code.
>>>>>>> 
>>>>>>> Regards
>>>>>>> 
>>>>>>> Antoine.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> With that in mind, I'd like to propose the following strategy:
>>>>>>>> 1. Write a single source file with simple, element-at-a-time
>> for
>>>> loop
>>>>>>>> implementations of each function.
>>>>>>>> 2. Compile this same source file several times with different
>>>> compile
>>>>>>> flags
>>>>>>>> for different vectorization (e.g. if we're on an x86 machine
>> that
>>>>>>> supports
>>>>>>>> AVX2 and AVX512, we'd compile once with -mavx2 and once with
>>>>>> -mavx512vl).
>>>>>>>> 3. Functions compiled with different instruction sets can be
>>>>>>> differentiated
>>>>>>>> by a namespace, which gets defined during the compiler
>>> invocation.
>>>>> For
>>>>>>>> example, for AVX2 we'd invoke the compiler with
>> -DNAMESPACE=AVX2
>>>> and
>>>>>> then
>>>>>>>> for something like elementwise addition of two arrays, we'd
>> call
>>>>>>>> arrow::AVX2::VectorAdd.
>>>>>>>> 
>>>>>>>> I believe this would let us remove xsimd as a dependency while
>>> also
>>>>>>> giving
>>>>>>>> us lots of vectorized kernels at the cost of some extra cmake
>>>> magic.
>>>>>>> After
>>>>>>>> that, it would just be a matter of making the function registry
>>>> point
>>>>>> to
>>>>>>>> these new functions.
>>>>>>>> 
>>>>>>>> Please let me know your thoughts!
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Sasha Krassovsky
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to