I looked into this a little bit more, and it appears GCC 7.5 is a lot more dependent on the unrolling parameter [1].
When looking at int64_t it appears that the assembly changes dramatically for GCC7.5 between 16 values unrolling 32 value unrolling. If I read the assembly correctly, the former keeps everything in registers, the latter will try to write results back to "sum_rounded" as part of the unrolling. Clang consistently keeps everything in registers (GCC 9 appears to generate similar code to Clang 8). This might be an argument for using a SIMD library even for simpler code for not being so reliant on magic numbers. [1] https://godbolt.org/z/DZamCk On Sat, Jun 20, 2020 at 10:32 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I just wanted to follow-up on this. I played around with the existing > master branch to do two things [1]: > 1. Inform alignment of the vector (the code can't be merged as is because > it doesn't handle unaligned vectors and needs to have a macro to handle > both MSVC and GCC/CLANG differences in defining attributes). > 2. Adjust the size of batched operations to be 256/sizeof(T). > > I did these both at once, but tweaked the batch size more. Alignment might > not be necessary for the results below (in which case making a PR would be > much simpler). > > Using these results. I compared against the PR that started this > discussion [2]. > > *TL;DR; (detailed results posted below):* > 1. The version provided for SSE4.2 is faster without intrinsics (I had to > hand disable runtime dispatch in the intrinsic PR by commenting out bits of > the CMakefile to make sure avx2 and avx512 were not used). I didn't look > closely at the PR to see if anything was specialized for SSE4.2 > 2. Intrinsic version is generally better (from 5% to 27%) on GCC 7.5 > using AVX2 (in the past I've found GCC 7.5 doesn't auto-vectorize as well > as GCC8 or Clang8). > 3. Clang8 is equivalent performance except for the Int8 case where the > intrinsic version is 29% faster. > > *Conclusions:* > 1. I think we should avoid intrinsics for the PR in question (or at least > split them into a separate PR so that we can focus on runtime dispatch). I > don't think the int8 use-case warrants the complexity of intrinsics (but we > should potentially investigate other methods to speed it up if we do think > it is important). I'm happy to re-assess if people think it is important. > 2. We should define which compiler to standardize on for reporting > performance benchmarks. Probably this should relate to either our wheel or > conda toolchains. > > Thanks, > Micah > > Detailed performance deltas. This is run on a GCE instance reporting > (model name : Intel(R) Xeon(R) CPU @ 2.20GHz). > > > *GCC 7.5 (SSE4.2) (% > reduction in runtime in intrinsic. + indicates instrinsics is worse)* > > SumKernelFloat/32768*/0* +0.1950 +0.1950 > > SumKernelDouble/32768*/0* +0.1946 +0.1946 > > SumKernelInt8/32768*/0* -0.0037 -0.0037 > > SumKernelInt16/32768*/0* +0.2803 +0.2804 > > SumKernelInt32/32768*/0* +0.6422 +0.6422 > > SumKernelInt64/32768*/0* +0.2221 +0.2221 > > > > *GCC 7.5 (AVX2) (reduction in runtime for > intrinsics)* > > SumKernelFloat/32768*/0* -0.1222 -0.1222 > > > SumKernelDouble/32768*/0* -0.2736 -0.2736 > > SumKernelInt8/32768*/0* -0.2745 -0.2745 > > SumKernelInt16/32768*/0* -0.1047 -0.1047 > > SumKernelInt32/32768*/0* -0.0518 -0.0518 > > SumKernelInt64/32768*/0* -0.2715 -0.2715 > > > > > *Clang (AVX2)* > > SumKernelFloat/32768*/0* +0.0134 +0.0134 > > SumKernelDouble/32768*/0* +0.0186 +0.0186 > > SumKernelInt8/32768*/0* -0.2883 -0.2883 > > SumKernelInt16/32768*/0* -0.0243 -0.0243 > > SumKernelInt32/32768*/0* +0.0120 +0.0120 > > SumKernelInt64/32768*/0* -0.0023 -0.0022 > > > [1] > https://github.com/emkornfield/arrow/commit/f8755a046055c0a24736793fac02e1e428854dd8 > [2] https://github.com/apache/arrow/pull/7314/ > > On Fri, Jun 12, 2020 at 4:45 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Le 12/06/2020 à 08:30, Micah Kornfield a écrit : >> > Hi Frank, >> > Are the performance numbers you published for the baseline directly from >> > master? I'd like to look at this over the next few days to see if I can >> > figure out what is going on. >> > >> > To all: >> > I'd like to make sure we flush out things to consider in general, for a >> > path forward. >> > >> > My take on this is we should still prefer writing code in this order: >> > 1. Plain-old C++ >> >> I don't know if that is included morally in your 1), but there's >> 1.5) C++ with hand-unrolled loops >> >> > 3. Native CPU intrinsics. We should develop a rubric for when to >> accept >> > PRs for this. This should include: >> > 1. Performance gain. >> > 2. General popularity of the architecture. >> >> This should also include 3) criticality of the optimization. For >> example, improving operations on null bitmaps sounds more critical for >> Arrow than optimizing e.g. the "ascii_lower" kernel. >> >> > For dynamic dispatch: >> > I think we should probably continue down the path of building our own. >> >> Agreed. There seem to be two possible approaches with this: >> 1) a hardwired list of tests for CPU flags >> 2) a lazily-cached function pointer >> >> Regards >> >> Antoine. >> >