Thanks Jed.
I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
build(-msse4.2)
[Unroll baseline]
for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
for (int64_t k = 0; k < kRoundFactor; k++) {
sum_rounded[k] += values[i + k];
}
}
SumKernelFloat/32768/0 2.91 us 2.90 us 239992
bytes_per_second=10.5063G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0 1.89 us 1.89 us 374470
bytes_per_second=16.1847G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0 11.6 us 11.6 us 60329
bytes_per_second=2.63274G/s null_percent=0 size=32.768k
SumKernelInt16/32768/0 6.98 us 6.98 us 100293
bytes_per_second=4.3737G/s null_percent=0 size=32.768k
SumKernelInt32/32768/0 3.89 us 3.88 us 180423
bytes_per_second=7.85862G/s null_percent=0 size=32.768k
SumKernelInt64/32768/0 1.86 us 1.85 us 380477
bytes_per_second=16.4536G/s null_percent=0 size=32.768k
[#pragma omp simd reduction(+:sum)]
#pragma omp simd reduction(+:sum)
for (int64_t i = 0; i < n; i++)
sum += values[i];
SumKernelFloat/32768/0 2.97 us 2.96 us 235686
bytes_per_second=10.294G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0 2.97 us 2.97 us 236456
bytes_per_second=10.2875G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0 11.7 us 11.7 us 60006
bytes_per_second=2.61643G/s null_percent=0 size=32.768k
SumKernelInt16/32768/0 5.47 us 5.47 us 127999
bytes_per_second=5.58002G/s null_percent=0 size=32.768k
SumKernelInt32/32768/0 2.42 us 2.41 us 290635
bytes_per_second=12.6485G/s null_percent=0 size=32.768k
SumKernelInt64/32768/0 1.82 us 1.82 us 386749
bytes_per_second=16.7733G/s null_percent=0 size=32.768k
[SSE intrinsic]
SumKernelFloat/32768/0 2.24 us 2.24 us 310914
bytes_per_second=13.6335G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0 1.43 us 1.43 us 486642
bytes_per_second=21.3266G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0 6.93 us 6.92 us 100720
bytes_per_second=4.41046G/s null_percent=0 size=32.768k
SumKernelInt16/32768/0 3.14 us 3.14 us 222803
bytes_per_second=9.72931G/s null_percent=0 size=32.768k
SumKernelInt32/32768/0 2.11 us 2.11 us 331388
bytes_per_second=14.4907G/s null_percent=0 size=32.768k
SumKernelInt64/32768/0 1.32 us 1.32 us 532964
bytes_per_second=23.0728G/s null_percent=0 size=32.768k
I tried to tweak the kRoundFactor or using some unroll based omp simd, or build
with clang-8, unluckily I never can get the results up to intrinsic. The ASM
code generated all use SIMD instructions, only some small difference like
instruction sequences or xmm register used. The things under compiler is really
some secret for me.
Thanks,
Frank
-----Original Message-----
From: Jed Brown <[email protected]>
Sent: Thursday, June 11, 2020 1:58 AM
To: Du, Frank <[email protected]>; [email protected]
Subject: RE: [C++][Discuss] Approaches for SIMD optimizations
"Du, Frank" <[email protected]> writes:
> The PR I committed provide a basic support for runtime dispatching. I
> agree that complier should generate good vectorize for the non-null
> data part but in fact it didn't, jedbrown point to it can force
> complier to SIMD using some additional pragmas, something like
> "#pragma omp simd reduction(+:sum)", I will try this pragma later but
> need figure out if it need a linking against OpenMP.
It does not require linking OpenMP. You just compile with -fopenmp-simd
(gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
pragmas. (These can be captured in macros using _Pragma.)
Note that you get automatic vectorization for this sort of thing without any
OpenMP if you add -funsafe-math-optimizations (included in -ffast-math).
https://gcc.godbolt.org/z/8thgru
Many projects don't want -funsafe-math-optimizations because there are places
where it can hurt numerical stability. ICC includes unsafe math in normal
optimization levels while GCC and Clang are more conservative.