I just wanted to follow-up on this. I played around with the existing master branch to do two things [1]: 1. Inform alignment of the vector (the code can't be merged as is because it doesn't handle unaligned vectors and needs to have a macro to handle both MSVC and GCC/CLANG differences in defining attributes). 2. Adjust the size of batched operations to be 256/sizeof(T).
I did these both at once, but tweaked the batch size more. Alignment might not be necessary for the results below (in which case making a PR would be much simpler). Using these results. I compared against the PR that started this discussion [2]. *TL;DR; (detailed results posted below):* 1. The version provided for SSE4.2 is faster without intrinsics (I had to hand disable runtime dispatch in the intrinsic PR by commenting out bits of the CMakefile to make sure avx2 and avx512 were not used). I didn't look closely at the PR to see if anything was specialized for SSE4.2 2. Intrinsic version is generally better (from 5% to 27%) on GCC 7.5 using AVX2 (in the past I've found GCC 7.5 doesn't auto-vectorize as well as GCC8 or Clang8). 3. Clang8 is equivalent performance except for the Int8 case where the intrinsic version is 29% faster. *Conclusions:* 1. I think we should avoid intrinsics for the PR in question (or at least split them into a separate PR so that we can focus on runtime dispatch). I don't think the int8 use-case warrants the complexity of intrinsics (but we should potentially investigate other methods to speed it up if we do think it is important). I'm happy to re-assess if people think it is important. 2. We should define which compiler to standardize on for reporting performance benchmarks. Probably this should relate to either our wheel or conda toolchains. Thanks, Micah Detailed performance deltas. This is run on a GCE instance reporting (model name : Intel(R) Xeon(R) CPU @ 2.20GHz). *GCC 7.5 (SSE4.2) (% reduction in runtime in intrinsic. + indicates instrinsics is worse)* SumKernelFloat/32768*/0* +0.1950 +0.1950 SumKernelDouble/32768*/0* +0.1946 +0.1946 SumKernelInt8/32768*/0* -0.0037 -0.0037 SumKernelInt16/32768*/0* +0.2803 +0.2804 SumKernelInt32/32768*/0* +0.6422 +0.6422 SumKernelInt64/32768*/0* +0.2221 +0.2221 *GCC 7.5 (AVX2) (reduction in runtime for intrinsics)* SumKernelFloat/32768*/0* -0.1222 -0.1222 SumKernelDouble/32768*/0* -0.2736 -0.2736 SumKernelInt8/32768*/0* -0.2745 -0.2745 SumKernelInt16/32768*/0* -0.1047 -0.1047 SumKernelInt32/32768*/0* -0.0518 -0.0518 SumKernelInt64/32768*/0* -0.2715 -0.2715 *Clang (AVX2)* SumKernelFloat/32768*/0* +0.0134 +0.0134 SumKernelDouble/32768*/0* +0.0186 +0.0186 SumKernelInt8/32768*/0* -0.2883 -0.2883 SumKernelInt16/32768*/0* -0.0243 -0.0243 SumKernelInt32/32768*/0* +0.0120 +0.0120 SumKernelInt64/32768*/0* -0.0023 -0.0022 [1] https://github.com/emkornfield/arrow/commit/f8755a046055c0a24736793fac02e1e428854dd8 [2] https://github.com/apache/arrow/pull/7314/ On Fri, Jun 12, 2020 at 4:45 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 12/06/2020 à 08:30, Micah Kornfield a écrit : > > Hi Frank, > > Are the performance numbers you published for the baseline directly from > > master? I'd like to look at this over the next few days to see if I can > > figure out what is going on. > > > > To all: > > I'd like to make sure we flush out things to consider in general, for a > > path forward. > > > > My take on this is we should still prefer writing code in this order: > > 1. Plain-old C++ > > I don't know if that is included morally in your 1), but there's > 1.5) C++ with hand-unrolled loops > > > 3. Native CPU intrinsics. We should develop a rubric for when to accept > > PRs for this. This should include: > > 1. Performance gain. > > 2. General popularity of the architecture. > > This should also include 3) criticality of the optimization. For > example, improving operations on null bitmaps sounds more critical for > Arrow than optimizing e.g. the "ascii_lower" kernel. > > > For dynamic dispatch: > > I think we should probably continue down the path of building our own. > > Agreed. There seem to be two possible approaches with this: > 1) a hardwired list of tests for CPU flags > 2) a lazily-cached function pointer > > Regards > > Antoine. >