I looked into this a little bit more, and it appears GCC 7.5 is a lot more
dependent on the unrolling parameter [1].

When looking at int64_t it appears that the assembly changes dramatically
for GCC7.5 between 16 values unrolling 32 value unrolling. If I read the
assembly correctly, the former keeps everything in registers, the latter
will try to write results back to "sum_rounded" as part of the unrolling.
Clang consistently keeps everything in registers (GCC 9 appears to generate
similar code to Clang 8).

This might be an argument for using a SIMD library even for simpler code
for not being so reliant on magic numbers.

[1] https://godbolt.org/z/DZamCk

On Sat, Jun 20, 2020 at 10:32 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I just wanted to follow-up on this.  I played around with the existing
> master branch to do two things [1]:
> 1.  Inform alignment of the vector (the code can't be merged as is because
> it doesn't handle unaligned vectors and needs to have a macro to handle
> both MSVC and GCC/CLANG differences in defining attributes).
> 2.  Adjust the size of batched operations to be 256/sizeof(T).
>
> I did these both at once, but tweaked the batch size more. Alignment might
> not be necessary for the results below (in which case making a PR would be
> much simpler).
>
> Using these results.  I compared against the PR that started this
> discussion [2].
>
> *TL;DR; (detailed results posted below):*
> 1.  The version provided for SSE4.2 is faster without intrinsics (I had to
> hand disable runtime dispatch in the intrinsic PR by commenting out bits of
> the CMakefile to make sure avx2 and avx512 were not used).  I didn't look
> closely at the PR to see if anything was specialized for SSE4.2
> 2.  Intrinsic version is generally better (from 5% to 27%) on GCC 7.5
> using AVX2 (in the past I've found GCC 7.5 doesn't auto-vectorize as well
> as GCC8 or Clang8).
> 3.  Clang8 is equivalent performance except for the Int8 case where the
> intrinsic version is 29% faster.
>
> *Conclusions:*
> 1.  I think we should avoid intrinsics for the PR in question (or at least
> split them into a separate PR so that we can focus on runtime dispatch).  I
> don't think the int8 use-case warrants the complexity of intrinsics (but we
> should potentially investigate other methods to speed it up if we do think
> it is important).  I'm happy to re-assess if people think it is important.
> 2.  We should define which compiler to standardize on for reporting
> performance benchmarks.  Probably this should relate to either our wheel or
> conda toolchains.
>
> Thanks,
> Micah
>
> Detailed performance deltas.  This is run on a GCE instance reporting
> (model name      : Intel(R) Xeon(R) CPU @ 2.20GHz).
>
>
> *GCC  7.5 (SSE4.2)                                                 (%
> reduction in runtime in intrinsic.  + indicates instrinsics is worse)*
>
> SumKernelFloat/32768*/0*                     +0.1950         +0.1950
>
> SumKernelDouble/32768*/0*                    +0.1946         +0.1946
>
> SumKernelInt8/32768*/0*                      -0.0037         -0.0037
>
> SumKernelInt16/32768*/0*                     +0.2803         +0.2804
>
> SumKernelInt32/32768*/0*                     +0.6422         +0.6422
>
> SumKernelInt64/32768*/0*                     +0.2221         +0.2221
>
>
>
> *GCC  7.5 (AVX2)                       (reduction in runtime for
> intrinsics)*
>
> SumKernelFloat/32768*/0*                     -0.1222         -0.1222
>
>
> SumKernelDouble/32768*/0*                    -0.2736         -0.2736
>
> SumKernelInt8/32768*/0*                      -0.2745         -0.2745
>
> SumKernelInt16/32768*/0*                     -0.1047         -0.1047
>
> SumKernelInt32/32768*/0*                     -0.0518         -0.0518
>
> SumKernelInt64/32768*/0*                     -0.2715         -0.2715
>
>
>
>
> *Clang (AVX2)*
>
> SumKernelFloat/32768*/0*                     +0.0134         +0.0134
>
> SumKernelDouble/32768*/0*                    +0.0186         +0.0186
>
> SumKernelInt8/32768*/0*                      -0.2883         -0.2883
>
> SumKernelInt16/32768*/0*                     -0.0243         -0.0243
>
> SumKernelInt32/32768*/0*                     +0.0120         +0.0120
>
> SumKernelInt64/32768*/0*                     -0.0023         -0.0022
>
>
> [1]
> https://github.com/emkornfield/arrow/commit/f8755a046055c0a24736793fac02e1e428854dd8
> [2] https://github.com/apache/arrow/pull/7314/
>
> On Fri, Jun 12, 2020 at 4:45 AM Antoine Pitrou <anto...@python.org> wrote:
>
>>
>> Le 12/06/2020 à 08:30, Micah Kornfield a écrit :
>> > Hi Frank,
>> > Are the performance numbers you published for the baseline directly from
>> > master?  I'd like to look at this over the next few days to see if I can
>> > figure out what is going on.
>> >
>> > To all:
>> > I'd like to make sure we flush out things to consider in general, for a
>> > path forward.
>> >
>> > My take on this is we should still prefer writing code in this order:
>> > 1.  Plain-old C++
>>
>> I don't know if that is included morally in your 1), but there's
>> 1.5) C++ with hand-unrolled loops
>>
>> > 3.  Native CPU intrinsics.  We should develop a rubric for when to
>> accept
>> > PRs for this.  This should include:
>> >        1.  Performance gain.
>> >        2.  General popularity of the architecture.
>>
>> This should also include 3) criticality of the optimization.  For
>> example, improving operations on null bitmaps sounds more critical for
>> Arrow than optimizing e.g. the "ascii_lower" kernel.
>>
>> > For dynamic dispatch:
>> > I think we should probably continue down the path of building our own.
>>
>> Agreed.  There seem to be two possible approaches with this:
>> 1) a hardwired list of tests for CPU flags
>> 2) a lazily-cached function pointer
>>
>> Regards
>>
>> Antoine.
>>
>

Reply via email to