Re: [C++][Discuss] Approaches for SIMD optimizations

Micah Kornfield Tue, 09 Jun 2020 23:52:03 -0700

>
> I agree that complier should generate good vectorize for the non-null data
> part but in fact it didn't,  jedbrown point to it can force complier to
> SIMD using some additional pragmas, something like "#pragma omp simd
> reduction(+:sum)"


It is an interesting question why.  We won't always be able to rely on the
compiler but if it does something unexpected, I'm not sure the best thing
is to jump to intrinsics.

In this case I think most of the gain could be done  by adjusting: "
constexpr int64_t kRoundFactor = 8; " [1]

To be constexpr int64_t kRoundFactor = SIMD_REGISTER_SIZE / sizeof(Type);

[1]
https://github.com/apache/arrow/blob/efb707a5438380dcef78418668b57c3f60592a23/cpp/src/arrow/compute/kernels/aggregate_basic.cc#L143

On Tue, Jun 9, 2020 at 11:04 PM Du, Frank <[email protected]> wrote:

> The PR I committed provide a basic support for runtime dispatching. I
> agree that complier should generate good vectorize for the non-null data
> part but in fact it didn't,  jedbrown point to it can force complier to
> SIMD using some additional pragmas, something like "#pragma omp simd
> reduction(+:sum)", I will try this pragma later but need figure out if it
> need a linking against OpenMP. As I said in the PR, the next step is to
> provide acceleration for nullable data part which is more typical in real
> world and hard to vectorize by compiler. The nullable path of manual
> intrinsic is very easy for AVX512 thanks to native support of mask[1]. I
> has some initial try on SSE path locally and conclude no much gain can be
> achieved, but I would expect it will be totally different for AVX2 as more
> calculation bandwidth provide by AVX2. Consider most recent x86 hardware
> has avx2 support already thus I can remove the SSE intrinsic path anyway to
> reduce one burden.
>
> For the SIMD wrapper, it seems popular compute library(Numpy, openblas,
> etc.) are using intrinsic directly also. I heard numpy is trying to unify a
> single interface but still struggle for many reasons, the hardware provide
> similar interface but still too many difference in detail.
>
> [1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Micah Kornfield <[email protected]>
> Sent: Wednesday, June 10, 2020 12:38 PM
> To: dev <[email protected]>
> Subject: Re: [C++][Discuss] Approaches for SIMD optimizations
>
> A few thoughts on this as a high level:
> 1.  Most of the libraries don't support runtime dispatch (libsimdpp seems
> to be the exception here), so we should decide if we want to roll our own
> dynamic dispatch mechanism.
> 2.  It isn't clear to me in the linked PR if the performance delta between
> SIMD generated code and what the compiler would generate.  For simple
> aggregates of non-null data I would expect pretty good auto-vectorization.
> Compiler auto-vectorization seems to get better over time.  For instance
> the scalar example linked in the paper seems to get vectorized somewhat
> under Clang 10 (https://godbolt.org/z/oPopQL).
> 3.  It appears there are some efforts to make a standardized C++ library
> [1] which might be based on Vc.
>
> My initial thought on this is that in the short-term would be to focus on
> the dynamic dispatch question (continue to build our own vs adopt an
> existing library) and lean the compiler for most vectorization. Using
> intrinsics should be limited to complex numerical functions and places
> where the compiler fails to vectorize/translate well (e.g. bit
> manipulations).
>
> If we do find the need for a dedicated library I would lean towards
> something that will converge to a standard to reduce additional
> dependencies in the long run. That being said most of these libraries seem
> to be header only so the dependency is fairly light-weight, so we can
> vendor them if need-be.
>
> [1] https://en.cppreference.com/w/cpp/experimental/simd
>
>
>
>
>
> On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Thank you.  xsimd used to require C++14, but apparently they have
> > demoted it to C++11.  Good!
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> > > Hi Antoine,
> > >
> > > Adding xsimd to the list of options:
> > >  * https://github.com/xtensor-stack/xsimd
> > > Not sure how it compares to the rest though.
> > >
> > > cheers,
> > >
> > > Maarten
> > >
> >
>

Re: [C++][Discuss] Approaches for SIMD optimizations

Reply via email to