Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2020-03-20 Thread Antoine Pitrou
On Fri, 20 Mar 2020 10:56:51 +0800
Yibo Cai  wrote:
> I'm revisiting this old thread as I see some avx512 code merged recently[1].
> Code maintenance will be non-trivial if we want to cover more 
> hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
> #ifdef is obviously no-go.
> 
> So I'm selling my proposal again :)
> - put all machine dependent code in one place (similar to what linux manages 
> various cpu arches)
> - add runtime dispatcher to select best simd code snippet per running hardware
> 
> I can provide a PR for community review first. Thoughts?

I would separate the two concerns.  The effects of a runtime dispatcher
can be negative for short runs (e.g. when decoding RLE-encoded Parquet
data).

Regards

Antoine.




Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2020-03-19 Thread Yibo Cai

Thanks Wes for quick response.
Yes, inlining can be a problem for runtime dispatcher. It means we should
take care of the whole loop[1], not the code inside the loop[2]. This may
lead to some traps to developer.

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bpacking.h#L3760
[2] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bpacking.h#L40

On 3/20/20 11:03 AM, Wes McKinney wrote:

hi Yibo,

I agree with this, having #ifdef in many places in the codebase is not
maintainable longer-term.

As far as runtime dispatch, we could populate a function table of all
machine-dependent functions once so then the dispatch isn't happening
on each function. Or some similar strategy

This of course presumes that functions with runtime SIMD dispatch do
not need to be inlined. For functions that need to be inlined, a
different approach may be required.

- Wes

On Thu, Mar 19, 2020 at 9:57 PM Yibo Cai  wrote:


I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more 
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
#ifdef is obviously no-go.

So I'm selling my proposal again :)
- put all machine dependent code in one place (similar to what linux manages 
various cpu arches)
- add runtime dispatcher to select best simd code snippet per running hardware

I can provide a PR for community review first. Thoughts?

[1] https://github.com/apache/arrow/pull/6650

On 2019/12/24 18:17:25, Wes McKinney  wrote:

If we go the route of AOT-compilation of Gandiva kernels as an>
approach to generate a shared library with many kernels, we might>
indeed look at possibly generating a "fat" binary with runtime>
dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD>
altogether) kernels. This is something we could do during the code>
generation step where we generate the "stubs" to invoke the IR>
kernels.>

Given where the project is at in its development trajectory, it seems>
important to come up with some concrete answers to some of these>
questions to reduce developer anxiety that may otherwise prevent>
forward progress in feature development.>

On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield  wrote:>



I would lean against adding another library dependency.  My main concerns>
with adding another library dependency are:>
1.  Supporting it across all of the build tool-chains (using a GCC specific>
option would be my least favorite approach).>
2.  Distributed binary size (for wheels at least people seem to care).>



I would like lean more towards yes if there were some real world benchmarks>
showing the a substantial performance gain.>



I don't think it is unreasonable to package our binaries targeting a common>
instruction set (e.g. AVX 1 or 2).  For those that want to make full use of>
their latest hardware compiling from source doesn't seem unreasonable,>
especially given the recent effort to trim dependencies.>



Cheers,>
Micah>





On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:>





Hi,>



I would recommend against reinventing the wheel.  It would be possible>
to reuse an existing C++ SIMD library.  There are several of them (Vc,>
xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible>
answer.>



Regards>



Antoine.>




Le 20/12/2019 à 08:32, Yibo Cai a écrit :>

Hi,>



I'm investigating SIMD support to C++ compute kernel(not gandiva).>



A typical case is the sum kernel[1]. Below tight loop can be easily>

optimized with SIMD.>



for (int64_t i = 0; i < length; i++) {>
local.sum += values[i];>
}>



Compiler already does loop vectorization. But it's done at compile time>

without knowledge of target cpu.>

Binaries compiled with avx-512 cannot run on old cpu, while binaries>

compiled with only sse4 enabled is suboptimal on new hardware.>



I have some proposals, would like to hear comments from community.>



- Based on our experience of ISA-L[2] project(optimized storage>

acceleration library for x86 and Arm), runtime dispatcher is a good>
approach. Basically, it links in codes optimized for different cpu>
features(sse4,avx2,neon,...) and selects the best one fits target cpu at>
first invocation. This is similar to gcc indirect function[3], but doesn't>
depend on compilers.>



- Use gcc FMV [4] to generate multiple binaries for one function. See>

sample source and compiled code [5].>

Though looks simple, it has many limitations: It's gcc specific>

feature, no support from clang and msvc. It only works on x86, no Arm>
support.>

I think this approach is no-go.>



- Don't do it.>
Gandiva leverages LLVM JIT for runtime code optimization. Is it>

duplicated effort to do it in C++ kernel? Will these vetorizable>
computations move to Gandiva in the future?>



[1]>

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106>

[2] https://github.com/intel/isa-l>
[3] 

Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2020-03-19 Thread Wes McKinney
hi Yibo,

I agree with this, having #ifdef in many places in the codebase is not
maintainable longer-term.

As far as runtime dispatch, we could populate a function table of all
machine-dependent functions once so then the dispatch isn't happening
on each function. Or some similar strategy

This of course presumes that functions with runtime SIMD dispatch do
not need to be inlined. For functions that need to be inlined, a
different approach may be required.

- Wes

On Thu, Mar 19, 2020 at 9:57 PM Yibo Cai  wrote:
>
> I'm revisiting this old thread as I see some avx512 code merged recently[1].
> Code maintenance will be non-trivial if we want to cover more 
> hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
> #ifdef is obviously no-go.
>
> So I'm selling my proposal again :)
> - put all machine dependent code in one place (similar to what linux manages 
> various cpu arches)
> - add runtime dispatcher to select best simd code snippet per running hardware
>
> I can provide a PR for community review first. Thoughts?
>
> [1] https://github.com/apache/arrow/pull/6650
>
> On 2019/12/24 18:17:25, Wes McKinney  wrote:
> > If we go the route of AOT-compilation of Gandiva kernels as an>
> > approach to generate a shared library with many kernels, we might>
> > indeed look at possibly generating a "fat" binary with runtime>
> > dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD>
> > altogether) kernels. This is something we could do during the code>
> > generation step where we generate the "stubs" to invoke the IR>
> > kernels.>
> >
> > Given where the project is at in its development trajectory, it seems>
> > important to come up with some concrete answers to some of these>
> > questions to reduce developer anxiety that may otherwise prevent>
> > forward progress in feature development.>
> >
> > On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield  wrote:>
> > >>
> > > I would lean against adding another library dependency.  My main concerns>
> > > with adding another library dependency are:>
> > > 1.  Supporting it across all of the build tool-chains (using a GCC 
> > > specific>
> > > option would be my least favorite approach).>
> > > 2.  Distributed binary size (for wheels at least people seem to care).>
> > >>
> > > I would like lean more towards yes if there were some real world 
> > > benchmarks>
> > > showing the a substantial performance gain.>
> > >>
> > > I don't think it is unreasonable to package our binaries targeting a 
> > > common>
> > > instruction set (e.g. AVX 1 or 2).  For those that want to make full use 
> > > of>
> > > their latest hardware compiling from source doesn't seem unreasonable,>
> > > especially given the recent effort to trim dependencies.>
> > >>
> > > Cheers,>
> > > Micah>
> > >>
> > >>
> > >>
> > > On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:>
> > >>
> > > >>
> > > > Hi,>
> > > >>
> > > > I would recommend against reinventing the wheel.  It would be possible>
> > > > to reuse an existing C++ SIMD library.  There are several of them (Vc,>
> > > > xsimd, libsimdpp...).  Of course, "just use Gandiva" is another 
> > > > possible>
> > > > answer.>
> > > >>
> > > > Regards>
> > > >>
> > > > Antoine.>
> > > >>
> > > >>
> > > > Le 20/12/2019 à 08:32, Yibo Cai a écrit :>
> > > > > Hi,>
> > > > >>
> > > > > I'm investigating SIMD support to C++ compute kernel(not gandiva).>
> > > > >>
> > > > > A typical case is the sum kernel[1]. Below tight loop can be easily>
> > > > optimized with SIMD.>
> > > > >>
> > > > > for (int64_t i = 0; i < length; i++) {>
> > > > >local.sum += values[i];>
> > > > > }>
> > > > >>
> > > > > Compiler already does loop vectorization. But it's done at compile 
> > > > > time>
> > > > without knowledge of target cpu.>
> > > > > Binaries compiled with avx-512 cannot run on old cpu, while binaries>
> > > > compiled with only sse4 enabled is suboptimal on new hardware.>
> > > > >>
> > > > > I have some proposals, would like to hear comments from community.>
> > > > >>
> > > > > - Based on our experience of ISA-L[2] project(optimized storage>
> > > > acceleration library for x86 and Arm), runtime dispatcher is a good>
> > > > approach. Basically, it links in codes optimized for different cpu>
> > > > features(sse4,avx2,neon,...) and selects the best one fits target cpu 
> > > > at>
> > > > first invocation. This is similar to gcc indirect function[3], but 
> > > > doesn't>
> > > > depend on compilers.>
> > > > >>
> > > > > - Use gcc FMV [4] to generate multiple binaries for one function. See>
> > > > sample source and compiled code [5].>
> > > > >Though looks simple, it has many limitations: It's gcc specific>
> > > > feature, no support from clang and msvc. It only works on x86, no Arm>
> > > > support.>
> > > > >I think this approach is no-go.>
> > > > >>
> > > > > - Don't do it.>
> > > > >Gandiva leverages LLVM JIT for runtime code optimization. Is it>
> > > > duplicated effort to do it in C++ kernel? 

Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2020-03-19 Thread Yibo Cai

I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more 
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
#ifdef is obviously no-go.

So I'm selling my proposal again :)
- put all machine dependent code in one place (similar to what linux manages 
various cpu arches)
- add runtime dispatcher to select best simd code snippet per running hardware

I can provide a PR for community review first. Thoughts?

[1] https://github.com/apache/arrow/pull/6650

On 2019/12/24 18:17:25, Wes McKinney  wrote:
If we go the route of AOT-compilation of Gandiva kernels as an> 
approach to generate a shared library with many kernels, we might> 
indeed look at possibly generating a "fat" binary with runtime> 
dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD> 
altogether) kernels. This is something we could do during the code> 
generation step where we generate the "stubs" to invoke the IR> 
kernels.> 

Given where the project is at in its development trajectory, it seems> 
important to come up with some concrete answers to some of these> 
questions to reduce developer anxiety that may otherwise prevent> 
forward progress in feature development.> 

On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield  wrote:> 
>> 
> I would lean against adding another library dependency.  My main concerns> 
> with adding another library dependency are:> 
> 1.  Supporting it across all of the build tool-chains (using a GCC specific> 
> option would be my least favorite approach).> 
> 2.  Distributed binary size (for wheels at least people seem to care).> 
>> 
> I would like lean more towards yes if there were some real world benchmarks> 
> showing the a substantial performance gain.> 
>> 
> I don't think it is unreasonable to package our binaries targeting a common> 
> instruction set (e.g. AVX 1 or 2).  For those that want to make full use of> 
> their latest hardware compiling from source doesn't seem unreasonable,> 
> especially given the recent effort to trim dependencies.> 
>> 
> Cheers,> 
> Micah> 
>> 
>> 
>> 
> On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:> 
>> 
> >> 
> > Hi,> 
> >> 
> > I would recommend against reinventing the wheel.  It would be possible> 
> > to reuse an existing C++ SIMD library.  There are several of them (Vc,> 
> > xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible> 
> > answer.> 
> >> 
> > Regards> 
> >> 
> > Antoine.> 
> >> 
> >> 
> > Le 20/12/2019 à 08:32, Yibo Cai a écrit :> 
> > > Hi,> 
> > >> 
> > > I'm investigating SIMD support to C++ compute kernel(not gandiva).> 
> > >> 
> > > A typical case is the sum kernel[1]. Below tight loop can be easily> 
> > optimized with SIMD.> 
> > >> 
> > > for (int64_t i = 0; i < length; i++) {> 
> > >local.sum += values[i];> 
> > > }> 
> > >> 
> > > Compiler already does loop vectorization. But it's done at compile time> 
> > without knowledge of target cpu.> 
> > > Binaries compiled with avx-512 cannot run on old cpu, while binaries> 
> > compiled with only sse4 enabled is suboptimal on new hardware.> 
> > >> 
> > > I have some proposals, would like to hear comments from community.> 
> > >> 
> > > - Based on our experience of ISA-L[2] project(optimized storage> 
> > acceleration library for x86 and Arm), runtime dispatcher is a good> 
> > approach. Basically, it links in codes optimized for different cpu> 
> > features(sse4,avx2,neon,...) and selects the best one fits target cpu at> 
> > first invocation. This is similar to gcc indirect function[3], but doesn't> 
> > depend on compilers.> 
> > >> 
> > > - Use gcc FMV [4] to generate multiple binaries for one function. See> 
> > sample source and compiled code [5].> 
> > >Though looks simple, it has many limitations: It's gcc specific> 
> > feature, no support from clang and msvc. It only works on x86, no Arm> 
> > support.> 
> > >I think this approach is no-go.> 
> > >> 
> > > - Don't do it.> 
> > >Gandiva leverages LLVM JIT for runtime code optimization. Is it> 
> > duplicated effort to do it in C++ kernel? Will these vetorizable> 
> > computations move to Gandiva in the future?> 
> > >> 
> > > [1]> 
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106> 
> > > [2] https://github.com/intel/isa-l> 
> > > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/> 
> > > [4] https://lwn.net/Articles/691932/> 
> > > [5] https://godbolt.org/z/ajpuq_> 
> > >> 
> >> 



Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2019-12-24 Thread Wes McKinney
If we go the route of AOT-compilation of Gandiva kernels as an
approach to generate a shared library with many kernels, we might
indeed look at possibly generating a "fat" binary with runtime
dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD
altogether) kernels. This is something we could do during the code
generation step where we generate the "stubs" to invoke the IR
kernels.

Given where the project is at in its development trajectory, it seems
important to come up with some concrete answers to some of these
questions to reduce developer anxiety that may otherwise prevent
forward progress in feature development.

On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield  wrote:
>
> I would lean against adding another library dependency.  My main concerns
> with adding another library dependency are:
> 1.  Supporting it across all of the build tool-chains (using a GCC specific
> option would be my least favorite approach).
> 2.  Distributed binary size (for wheels at least people seem to care).
>
> I would like lean more towards yes if there were some real world benchmarks
> showing the a substantial performance gain.
>
> I don't think it is unreasonable to package our binaries targeting a common
> instruction set (e.g. AVX 1 or 2).  For those that want to make full use of
> their latest hardware compiling from source doesn't seem unreasonable,
> especially given the recent effort to trim dependencies.
>
> Cheers,
> Micah
>
>
>
> On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:
>
> >
> > Hi,
> >
> > I would recommend against reinventing the wheel.  It would be possible
> > to reuse an existing C++ SIMD library.  There are several of them (Vc,
> > xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible
> > answer.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 20/12/2019 à 08:32, Yibo Cai a écrit :
> > > Hi,
> > >
> > > I'm investigating SIMD support to C++ compute kernel(not gandiva).
> > >
> > > A typical case is the sum kernel[1]. Below tight loop can be easily
> > optimized with SIMD.
> > >
> > > for (int64_t i = 0; i < length; i++) {
> > >local.sum += values[i];
> > > }
> > >
> > > Compiler already does loop vectorization. But it's done at compile time
> > without knowledge of target cpu.
> > > Binaries compiled with avx-512 cannot run on old cpu, while binaries
> > compiled with only sse4 enabled is suboptimal on new hardware.
> > >
> > > I have some proposals, would like to hear comments from community.
> > >
> > > - Based on our experience of ISA-L[2] project(optimized storage
> > acceleration library for x86 and Arm), runtime dispatcher is a good
> > approach. Basically, it links in codes optimized for different cpu
> > features(sse4,avx2,neon,...) and selects the best one fits target cpu at
> > first invocation. This is similar to gcc indirect function[3], but doesn't
> > depend on compilers.
> > >
> > > - Use gcc FMV [4] to generate multiple binaries for one function. See
> > sample source and compiled code [5].
> > >Though looks simple, it has many limitations: It's gcc specific
> > feature, no support from clang and msvc. It only works on x86, no Arm
> > support.
> > >I think this approach is no-go.
> > >
> > > - Don't do it.
> > >Gandiva leverages LLVM JIT for runtime code optimization. Is it
> > duplicated effort to do it in C++ kernel? Will these vetorizable
> > computations move to Gandiva in the future?
> > >
> > > [1]
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
> > > [2] https://github.com/intel/isa-l
> > > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
> > > [4] https://lwn.net/Articles/691932/
> > > [5] https://godbolt.org/z/ajpuq_
> > >
> >


Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2019-12-20 Thread Antoine Pitrou


Hi,

I would recommend against reinventing the wheel.  It would be possible
to reuse an existing C++ SIMD library.  There are several of them (Vc,
xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible
answer.

Regards

Antoine.


Le 20/12/2019 à 08:32, Yibo Cai a écrit :
> Hi,
> 
> I'm investigating SIMD support to C++ compute kernel(not gandiva).
> 
> A typical case is the sum kernel[1]. Below tight loop can be easily optimized 
> with SIMD.
> 
> for (int64_t i = 0; i < length; i++) {
>local.sum += values[i];
> }
> 
> Compiler already does loop vectorization. But it's done at compile time 
> without knowledge of target cpu.
> Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled 
> with only sse4 enabled is suboptimal on new hardware.
> 
> I have some proposals, would like to hear comments from community.
> 
> - Based on our experience of ISA-L[2] project(optimized storage acceleration 
> library for x86 and Arm), runtime dispatcher is a good approach. Basically, 
> it links in codes optimized for different cpu features(sse4,avx2,neon,...) 
> and selects the best one fits target cpu at first invocation. This is similar 
> to gcc indirect function[3], but doesn't depend on compilers.
> 
> - Use gcc FMV [4] to generate multiple binaries for one function. See sample 
> source and compiled code [5].
>Though looks simple, it has many limitations: It's gcc specific feature, 
> no support from clang and msvc. It only works on x86, no Arm support.
>I think this approach is no-go.
> 
> - Don't do it.
>Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated 
> effort to do it in C++ kernel? Will these vetorizable computations move to 
> Gandiva in the future?
> 
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
> [2] https://github.com/intel/isa-l
> [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
> [4] https://lwn.net/Articles/691932/
> [5] https://godbolt.org/z/ajpuq_
>