The PR I committed provide a basic support for runtime dispatching. I agree that complier should generate good vectorize for the non-null data part but in fact it didn't, jedbrown point to it can force complier to SIMD using some additional pragmas, something like "#pragma omp simd reduction(+:sum)", I will try this pragma later but need figure out if it need a linking against OpenMP. As I said in the PR, the next step is to provide acceleration for nullable data part which is more typical in real world and hard to vectorize by compiler. The nullable path of manual intrinsic is very easy for AVX512 thanks to native support of mask[1]. I has some initial try on SSE path locally and conclude no much gain can be achieved, but I would expect it will be totally different for AVX2 as more calculation bandwidth provide by AVX2. Consider most recent x86 hardware has avx2 support already thus I can remove the SSE intrinsic path anyway to reduce one burden.
For the SIMD wrapper, it seems popular compute library(Numpy, openblas, etc.) are using intrinsic directly also. I heard numpy is trying to unify a single interface but still struggle for many reasons, the hardware provide similar interface but still too many difference in detail. [1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers Thanks, Frank -----Original Message----- From: Micah Kornfield <emkornfi...@gmail.com> Sent: Wednesday, June 10, 2020 12:38 PM To: dev <dev@arrow.apache.org> Subject: Re: [C++][Discuss] Approaches for SIMD optimizations A few thoughts on this as a high level: 1. Most of the libraries don't support runtime dispatch (libsimdpp seems to be the exception here), so we should decide if we want to roll our own dynamic dispatch mechanism. 2. It isn't clear to me in the linked PR if the performance delta between SIMD generated code and what the compiler would generate. For simple aggregates of non-null data I would expect pretty good auto-vectorization. Compiler auto-vectorization seems to get better over time. For instance the scalar example linked in the paper seems to get vectorized somewhat under Clang 10 (https://godbolt.org/z/oPopQL). 3. It appears there are some efforts to make a standardized C++ library [1] which might be based on Vc. My initial thought on this is that in the short-term would be to focus on the dynamic dispatch question (continue to build our own vs adopt an existing library) and lean the compiler for most vectorization. Using intrinsics should be limited to complex numerical functions and places where the compiler fails to vectorize/translate well (e.g. bit manipulations). If we do find the need for a dedicated library I would lean towards something that will converge to a standard to reduce additional dependencies in the long run. That being said most of these libraries seem to be header only so the dependency is fairly light-weight, so we can vendor them if need-be. [1] https://en.cppreference.com/w/cpp/experimental/simd On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <anto...@python.org> wrote: > > Thank you. xsimd used to require C++14, but apparently they have > demoted it to C++11. Good! > > Regards > > Antoine. > > > Le 09/06/2020 à 12:04, Maarten Breddels a écrit : > > Hi Antoine, > > > > Adding xsimd to the list of options: > > * https://github.com/xtensor-stack/xsimd > > Not sure how it compares to the rest though. > > > > cheers, > > > > Maarten > > >