Thanks Jed. I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE build(-msse4.2)
[Unroll baseline] for (int64_t i = 0; i < length_rounded; i += kRoundFactor) { for (int64_t k = 0; k < kRoundFactor; k++) { sum_rounded[k] += values[i + k]; } } SumKernelFloat/32768/0 2.91 us 2.90 us 239992 bytes_per_second=10.5063G/s null_percent=0 size=32.768k SumKernelDouble/32768/0 1.89 us 1.89 us 374470 bytes_per_second=16.1847G/s null_percent=0 size=32.768k SumKernelInt8/32768/0 11.6 us 11.6 us 60329 bytes_per_second=2.63274G/s null_percent=0 size=32.768k SumKernelInt16/32768/0 6.98 us 6.98 us 100293 bytes_per_second=4.3737G/s null_percent=0 size=32.768k SumKernelInt32/32768/0 3.89 us 3.88 us 180423 bytes_per_second=7.85862G/s null_percent=0 size=32.768k SumKernelInt64/32768/0 1.86 us 1.85 us 380477 bytes_per_second=16.4536G/s null_percent=0 size=32.768k [#pragma omp simd reduction(+:sum)] #pragma omp simd reduction(+:sum) for (int64_t i = 0; i < n; i++) sum += values[i]; SumKernelFloat/32768/0 2.97 us 2.96 us 235686 bytes_per_second=10.294G/s null_percent=0 size=32.768k SumKernelDouble/32768/0 2.97 us 2.97 us 236456 bytes_per_second=10.2875G/s null_percent=0 size=32.768k SumKernelInt8/32768/0 11.7 us 11.7 us 60006 bytes_per_second=2.61643G/s null_percent=0 size=32.768k SumKernelInt16/32768/0 5.47 us 5.47 us 127999 bytes_per_second=5.58002G/s null_percent=0 size=32.768k SumKernelInt32/32768/0 2.42 us 2.41 us 290635 bytes_per_second=12.6485G/s null_percent=0 size=32.768k SumKernelInt64/32768/0 1.82 us 1.82 us 386749 bytes_per_second=16.7733G/s null_percent=0 size=32.768k [SSE intrinsic] SumKernelFloat/32768/0 2.24 us 2.24 us 310914 bytes_per_second=13.6335G/s null_percent=0 size=32.768k SumKernelDouble/32768/0 1.43 us 1.43 us 486642 bytes_per_second=21.3266G/s null_percent=0 size=32.768k SumKernelInt8/32768/0 6.93 us 6.92 us 100720 bytes_per_second=4.41046G/s null_percent=0 size=32.768k SumKernelInt16/32768/0 3.14 us 3.14 us 222803 bytes_per_second=9.72931G/s null_percent=0 size=32.768k SumKernelInt32/32768/0 2.11 us 2.11 us 331388 bytes_per_second=14.4907G/s null_percent=0 size=32.768k SumKernelInt64/32768/0 1.32 us 1.32 us 532964 bytes_per_second=23.0728G/s null_percent=0 size=32.768k I tried to tweak the kRoundFactor or using some unroll based omp simd, or build with clang-8, unluckily I never can get the results up to intrinsic. The ASM code generated all use SIMD instructions, only some small difference like instruction sequences or xmm register used. The things under compiler is really some secret for me. Thanks, Frank -----Original Message----- From: Jed Brown <j...@jedbrown.org> Sent: Thursday, June 11, 2020 1:58 AM To: Du, Frank <frank...@intel.com>; dev@arrow.apache.org Subject: RE: [C++][Discuss] Approaches for SIMD optimizations "Du, Frank" <frank...@intel.com> writes: > The PR I committed provide a basic support for runtime dispatching. I > agree that complier should generate good vectorize for the non-null > data part but in fact it didn't, jedbrown point to it can force > complier to SIMD using some additional pragmas, something like > "#pragma omp simd reduction(+:sum)", I will try this pragma later but > need figure out if it need a linking against OpenMP. It does not require linking OpenMP. You just compile with -fopenmp-simd (gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd" pragmas. (These can be captured in macros using _Pragma.) Note that you get automatic vectorization for this sort of thing without any OpenMP if you add -funsafe-math-optimizations (included in -ffast-math). https://gcc.godbolt.org/z/8thgru Many projects don't want -funsafe-math-optimizations because there are places where it can hurt numerical stability. ICC includes unsafe math in normal optimization levels while GCC and Clang are more conservative.