Re: [C++][Compute] RFC: add SIMD support to C++ kernel

Yibo Cai Thu, 19 Mar 2020 19:57:24 -0700

I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more 
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
#ifdef is obviously no-go.


So I'm selling my proposal again :)
- put all machine dependent code in one place (similar to what linux manages 
various cpu arches)
- add runtime dispatcher to select best simd code snippet per running hardware

I can provide a PR for community review first. Thoughts?

[1] https://github.com/apache/arrow/pull/6650

On 2019/12/24 18:17:25, Wes McKinney <w...@gmail.com> wrote:

If we go the route of AOT-compilation of Gandiva kernels as an>approach to generate a shared library with many kernels, we might>indeed look at possibly generating a "fat" binary with runtime>dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD>altogether) kernels. This is something we could do during the code>generation step where we generate the "stubs" to invoke the IR>kernels.>Given where the project is at in its development trajectory, it seems>important to come up with some concrete answers to some of these>questions to reduce developer anxiety that may otherwise prevent>forward progress in feature development.>On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield <em...@gmail.com> wrote:>>>> I would lean against adding another library dependency. My main concerns>> with adding another library dependency are:>> 1. Supporting it across all of the build tool-chains (using a GCC specific>> option would be my least favorite approach).>> 2. Distributed binary size (for wheels at least people seem to care).>>>> I would like lean more towards yes if there were some real world benchmarks>> showing the a substantial performance gain.>>>> I don't think it is unreasonable to package our binaries targeting a common>> instruction set (e.g. AVX 1 or 2). For those that want to make full use of>> their latest hardware compiling from source doesn't seem unreasonable,>> especially given the recent effort to trim dependencies.>>>> Cheers,>> Micah>>>>>>>> On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou <an...@python.org> wrote:>>>> >>> > Hi,>> >>> > I would recommend against reinventing the wheel. It would be possible>> > to reuse an existing C++ SIMD library. There are several of them (Vc,>> > xsimd, libsimdpp...). Of course, "just use Gandiva" is another possible>> > answer.>> >>> > Regards>> >>> > Antoine.>> >>> >>> > Le 20/12/2019 à 08:32, Yibo Cai a écrit :>> > > Hi,>> > >>> > > I'm investigating SIMD support to C++ compute kernel(not gandiva).>> > >>> > > A typical case is the sum kernel[1]. Below tight loop can be easily>> > optimized with SIMD.>> > >>> > > for (int64_t i = 0; i < length; i++) {>> > > local.sum += values[i];>> > > }>> > >>> > > Compiler already does loop vectorization. But it's done at compile time>> > without knowledge of target cpu.>> > > Binaries compiled with avx-512 cannot run on old cpu, while binaries>> > compiled with only sse4 enabled is suboptimal on new hardware.>> > >>> > > I have some proposals, would like to hear comments from community.>> > >>> > > - Based on our experience of ISA-L[2] project(optimized storage>> > acceleration library for x86 and Arm), runtime dispatcher is a good>> > approach. Basically, it links in codes optimized for different cpu>> > features(sse4,avx2,neon,...) and selects the best one fits target cpu at>> > first invocation. This is similar to gcc indirect function[3], but doesn't>> > depend on compilers.>> > >>> > > - Use gcc FMV [4] to generate multiple binaries for one function. See>> > sample source and compiled code [5].>> > > Though looks simple, it has many limitations: It's gcc specific>> > feature, no support from clang and msvc. It only works on x86, no Arm>> > support.>> > > I think this approach is no-go.>> > >>> > > - Don't do it.>> > > Gandiva leverages LLVM JIT for runtime code optimization. Is it>> > duplicated effort to do it in C++ kernel? Will these vetorizable>> > computations move to Gandiva in the future?>> > >>> > > [1]>> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106>> > > [2] https://github.com/intel/isa-l>> > > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/>> > > [4] https://lwn.net/Articles/691932/>> > > [5] https://godbolt.org/z/ajpuq_>> > >>> >>

Re: [C++][Compute] RFC: add SIMD support to C++ kernel

Reply via email to