The PR I committed provide a basic support for runtime dispatching. I agree 
that complier should generate good vectorize for the non-null data part but in 
fact it didn't,  jedbrown point to it can force complier to SIMD using some 
additional pragmas, something like "#pragma omp simd reduction(+:sum)", I will 
try this pragma later but need figure out if it need a linking against OpenMP. 
As I said in the PR, the next step is to provide acceleration for nullable data 
part which is more typical in real world and hard to vectorize by compiler. The 
nullable path of manual intrinsic is very easy for AVX512 thanks to native 
support of mask[1]. I has some initial try on SSE path locally and conclude no 
much gain can be achieved, but I would expect it will be totally different for 
AVX2 as more calculation bandwidth provide by AVX2. Consider most recent x86 
hardware has avx2 support already thus I can remove the SSE intrinsic path 
anyway to reduce one burden.

For the SIMD wrapper, it seems popular compute library(Numpy, openblas, etc.) 
are using intrinsic directly also. I heard numpy is trying to unify a single 
interface but still struggle for many reasons, the hardware provide similar 
interface but still too many difference in detail. 

[1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers

Thanks,
Frank

-----Original Message-----
From: Micah Kornfield <emkornfi...@gmail.com> 
Sent: Wednesday, June 10, 2020 12:38 PM
To: dev <dev@arrow.apache.org>
Subject: Re: [C++][Discuss] Approaches for SIMD optimizations

A few thoughts on this as a high level:
1.  Most of the libraries don't support runtime dispatch (libsimdpp seems to be 
the exception here), so we should decide if we want to roll our own dynamic 
dispatch mechanism.
2.  It isn't clear to me in the linked PR if the performance delta between SIMD 
generated code and what the compiler would generate.  For simple aggregates of 
non-null data I would expect pretty good auto-vectorization.
Compiler auto-vectorization seems to get better over time.  For instance the 
scalar example linked in the paper seems to get vectorized somewhat under Clang 
10 (https://godbolt.org/z/oPopQL).
3.  It appears there are some efforts to make a standardized C++ library [1] 
which might be based on Vc.

My initial thought on this is that in the short-term would be to focus on the 
dynamic dispatch question (continue to build our own vs adopt an existing 
library) and lean the compiler for most vectorization. Using intrinsics should 
be limited to complex numerical functions and places where the compiler fails 
to vectorize/translate well (e.g. bit manipulations).

If we do find the need for a dedicated library I would lean towards something 
that will converge to a standard to reduce additional dependencies in the long 
run. That being said most of these libraries seem to be header only so the 
dependency is fairly light-weight, so we can vendor them if need-be.

[1] https://en.cppreference.com/w/cpp/experimental/simd





On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Thank you.  xsimd used to require C++14, but apparently they have 
> demoted it to C++11.  Good!
>
> Regards
>
> Antoine.
>
>
> Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> > Hi Antoine,
> >
> > Adding xsimd to the list of options:
> >  * https://github.com/xtensor-stack/xsimd
> > Not sure how it compares to the rest though.
> >
> > cheers,
> >
> > Maarten
> >
>

Reply via email to