I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more 
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
#ifdef is obviously no-go.

So I'm selling my proposal again :)
- put all machine dependent code in one place (similar to what linux manages 
various cpu arches)
- add runtime dispatcher to select best simd code snippet per running hardware

I can provide a PR for community review first. Thoughts?

[1] https://github.com/apache/arrow/pull/6650

On 2019/12/24 18:17:25, Wes McKinney <w...@gmail.com> wrote:
If we go the route of AOT-compilation of Gandiva kernels as an> approach to generate a shared library with many kernels, we might> indeed look at possibly generating a "fat" binary with runtime> dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD> altogether) kernels. This is something we could do during the code> generation step where we generate the "stubs" to invoke the IR> kernels.> Given where the project is at in its development trajectory, it seems> important to come up with some concrete answers to some of these> questions to reduce developer anxiety that may otherwise prevent> forward progress in feature development.> On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield <em...@gmail.com> wrote:> >> > I would lean against adding another library dependency. My main concerns> > with adding another library dependency are:> > 1. Supporting it across all of the build tool-chains (using a GCC specific> > option would be my least favorite approach).> > 2. Distributed binary size (for wheels at least people seem to care).> >> > I would like lean more towards yes if there were some real world benchmarks> > showing the a substantial performance gain.> >> > I don't think it is unreasonable to package our binaries targeting a common> > instruction set (e.g. AVX 1 or 2). For those that want to make full use of> > their latest hardware compiling from source doesn't seem unreasonable,> > especially given the recent effort to trim dependencies.> >> > Cheers,> > Micah> >> >> >> > On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou <an...@python.org> wrote:> >> > >> > > Hi,> > >> > > I would recommend against reinventing the wheel. It would be possible> > > to reuse an existing C++ SIMD library. There are several of them (Vc,> > > xsimd, libsimdpp...). Of course, "just use Gandiva" is another possible> > > answer.> > >> > > Regards> > >> > > Antoine.> > >> > >> > > Le 20/12/2019 à 08:32, Yibo Cai a écrit :> > > > Hi,> > > >> > > > I'm investigating SIMD support to C++ compute kernel(not gandiva).> > > >> > > > A typical case is the sum kernel[1]. Below tight loop can be easily> > > optimized with SIMD.> > > >> > > > for (int64_t i = 0; i < length; i++) {> > > > local.sum += values[i];> > > > }> > > >> > > > Compiler already does loop vectorization. But it's done at compile time> > > without knowledge of target cpu.> > > > Binaries compiled with avx-512 cannot run on old cpu, while binaries> > > compiled with only sse4 enabled is suboptimal on new hardware.> > > >> > > > I have some proposals, would like to hear comments from community.> > > >> > > > - Based on our experience of ISA-L[2] project(optimized storage> > > acceleration library for x86 and Arm), runtime dispatcher is a good> > > approach. Basically, it links in codes optimized for different cpu> > > features(sse4,avx2,neon,...) and selects the best one fits target cpu at> > > first invocation. This is similar to gcc indirect function[3], but doesn't> > > depend on compilers.> > > >> > > > - Use gcc FMV [4] to generate multiple binaries for one function. See> > > sample source and compiled code [5].> > > > Though looks simple, it has many limitations: It's gcc specific> > > feature, no support from clang and msvc. It only works on x86, no Arm> > > support.> > > > I think this approach is no-go.> > > >> > > > - Don't do it.> > > > Gandiva leverages LLVM JIT for runtime code optimization. Is it> > > duplicated effort to do it in C++ kernel? Will these vetorizable> > > computations move to Gandiva in the future?> > > >> > > > [1]> > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106> > > > [2] https://github.com/intel/isa-l> > > > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/> > > > [4] https://lwn.net/Articles/691932/> > > > [5] https://godbolt.org/z/ajpuq_> > > >> > >>

Reply via email to