I just wanted to follow-up on this.  I played around with the existing
master branch to do two things [1]:
1.  Inform alignment of the vector (the code can't be merged as is because
it doesn't handle unaligned vectors and needs to have a macro to handle
both MSVC and GCC/CLANG differences in defining attributes).
2.  Adjust the size of batched operations to be 256/sizeof(T).

I did these both at once, but tweaked the batch size more. Alignment might
not be necessary for the results below (in which case making a PR would be
much simpler).

Using these results.  I compared against the PR that started this
discussion [2].

*TL;DR; (detailed results posted below):*
1.  The version provided for SSE4.2 is faster without intrinsics (I had to
hand disable runtime dispatch in the intrinsic PR by commenting out bits of
the CMakefile to make sure avx2 and avx512 were not used).  I didn't look
closely at the PR to see if anything was specialized for SSE4.2
2.  Intrinsic version is generally better (from 5% to 27%) on GCC 7.5 using
AVX2 (in the past I've found GCC 7.5 doesn't auto-vectorize as well as GCC8
or Clang8).
3.  Clang8 is equivalent performance except for the Int8 case where the
intrinsic version is 29% faster.

*Conclusions:*
1.  I think we should avoid intrinsics for the PR in question (or at least
split them into a separate PR so that we can focus on runtime dispatch).  I
don't think the int8 use-case warrants the complexity of intrinsics (but we
should potentially investigate other methods to speed it up if we do think
it is important).  I'm happy to re-assess if people think it is important.
2.  We should define which compiler to standardize on for reporting
performance benchmarks.  Probably this should relate to either our wheel or
conda toolchains.

Thanks,
Micah

Detailed performance deltas.  This is run on a GCE instance reporting
(model name      : Intel(R) Xeon(R) CPU @ 2.20GHz).


*GCC  7.5 (SSE4.2)                                                 (%
reduction in runtime in intrinsic.  + indicates instrinsics is worse)*

SumKernelFloat/32768*/0*                     +0.1950         +0.1950

SumKernelDouble/32768*/0*                    +0.1946         +0.1946

SumKernelInt8/32768*/0*                      -0.0037         -0.0037

SumKernelInt16/32768*/0*                     +0.2803         +0.2804

SumKernelInt32/32768*/0*                     +0.6422         +0.6422

SumKernelInt64/32768*/0*                     +0.2221         +0.2221



*GCC  7.5 (AVX2)                       (reduction in runtime for
intrinsics)*

SumKernelFloat/32768*/0*                     -0.1222         -0.1222

SumKernelDouble/32768*/0*                    -0.2736         -0.2736

SumKernelInt8/32768*/0*                      -0.2745         -0.2745

SumKernelInt16/32768*/0*                     -0.1047         -0.1047

SumKernelInt32/32768*/0*                     -0.0518         -0.0518

SumKernelInt64/32768*/0*                     -0.2715         -0.2715




*Clang (AVX2)*

SumKernelFloat/32768*/0*                     +0.0134         +0.0134

SumKernelDouble/32768*/0*                    +0.0186         +0.0186

SumKernelInt8/32768*/0*                      -0.2883         -0.2883

SumKernelInt16/32768*/0*                     -0.0243         -0.0243

SumKernelInt32/32768*/0*                     +0.0120         +0.0120

SumKernelInt64/32768*/0*                     -0.0023         -0.0022


[1]
https://github.com/emkornfield/arrow/commit/f8755a046055c0a24736793fac02e1e428854dd8
[2] https://github.com/apache/arrow/pull/7314/

On Fri, Jun 12, 2020 at 4:45 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 12/06/2020 à 08:30, Micah Kornfield a écrit :
> > Hi Frank,
> > Are the performance numbers you published for the baseline directly from
> > master?  I'd like to look at this over the next few days to see if I can
> > figure out what is going on.
> >
> > To all:
> > I'd like to make sure we flush out things to consider in general, for a
> > path forward.
> >
> > My take on this is we should still prefer writing code in this order:
> > 1.  Plain-old C++
>
> I don't know if that is included morally in your 1), but there's
> 1.5) C++ with hand-unrolled loops
>
> > 3.  Native CPU intrinsics.  We should develop a rubric for when to accept
> > PRs for this.  This should include:
> >        1.  Performance gain.
> >        2.  General popularity of the architecture.
>
> This should also include 3) criticality of the optimization.  For
> example, improving operations on null bitmaps sounds more critical for
> Arrow than optimizing e.g. the "ascii_lower" kernel.
>
> > For dynamic dispatch:
> > I think we should probably continue down the path of building our own.
>
> Agreed.  There seem to be two possible approaches with this:
> 1) a hardwired list of tests for CPU flags
> 2) a lazily-cached function pointer
>
> Regards
>
> Antoine.
>

Reply via email to