Hi Micah,

Yes, on the latest baseline.

One thing for the SIMD library, can we take account the feature of mask op[1], 
MIPP[2] support it already. It is very useful as most of arrow function comes 
with a valid bit map.

[1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers
[2] https://github.com/aff3ct/MIPP#masked-instructions

Thanks,
Frank

-----Original Message-----
From: Micah Kornfield <emkornfi...@gmail.com> 
Sent: Friday, June 12, 2020 2:31 PM
To: dev <dev@arrow.apache.org>
Subject: Re: [C++][Discuss] Approaches for SIMD optimizations

Hi Frank,
Are the performance numbers you published for the baseline directly from 
master?  I'd like to look at this over the next few days to see if I can figure 
out what is going on.

To all:
I'd like to make sure we flush out things to consider in general, for a path 
forward.

My take on this is we should still prefer writing code in this order:
1.  Plain-old C++
2.  SIMD Wrapper library (my preference would be towards something that is 
going to be standardized eventually to limit 3P dependencies.  I think the 
counter argument here is if any of the libraries mentioned above has much 
better feature coverage on advanced instruction sets).  Please chime in if 
there are other things to consider.  We should have some rubrics for when to 
make use of the library (i.e. what performance gain do we get on a workload).
3.  Native CPU intrinsics.  We should develop a rubric for when to accept PRs 
for this.  This should include:
       1.  Performance gain.
       2.  General popularity of the architecture.

For dynamic dispatch:
I think we should probably continue down the path of building our own.  I 
looked more at libsimdpp's implementation and it might be something we can use 
for guidance, but as it stands, it doesn't seem to have hooks based on CPU 
manufacturer, which for BMI2 intrinsics would be a requirement.  The 
alternative would be to ban BMI2 intrinsics from the code (this might not be a 
bad idea to limit complexity in general).

Thoughts?

Thanks,
Micah









On Wed, Jun 10, 2020 at 8:35 PM Du, Frank <frank...@intel.com> wrote:

> Thanks Jed.
>
> I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
> build(-msse4.2)
>
> [Unroll baseline]
>     for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
>       for (int64_t k = 0; k < kRoundFactor; k++) {
>         sum_rounded[k] += values[i + k];
>       }
>     }
> SumKernelFloat/32768/0        2.91 us         2.90 us       239992
> bytes_per_second=10.5063G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       1.89 us         1.89 us       374470
> bytes_per_second=16.1847G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         11.6 us         11.6 us        60329
> bytes_per_second=2.63274G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        6.98 us         6.98 us       100293
> bytes_per_second=4.3737G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        3.89 us         3.88 us       180423
> bytes_per_second=7.85862G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.86 us         1.85 us       380477
> bytes_per_second=16.4536G/s null_percent=0 size=32.768k
>
> [#pragma omp simd reduction(+:sum)]
> #pragma omp simd reduction(+:sum)
>     for (int64_t i = 0; i < n; i++)
>         sum += values[i];
> SumKernelFloat/32768/0        2.97 us         2.96 us       235686
> bytes_per_second=10.294G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       2.97 us         2.97 us       236456
> bytes_per_second=10.2875G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         11.7 us         11.7 us        60006
> bytes_per_second=2.61643G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        5.47 us         5.47 us       127999
> bytes_per_second=5.58002G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        2.42 us         2.41 us       290635
> bytes_per_second=12.6485G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.82 us         1.82 us       386749
> bytes_per_second=16.7733G/s null_percent=0 size=32.768k
>
> [SSE intrinsic]
> SumKernelFloat/32768/0        2.24 us         2.24 us       310914
> bytes_per_second=13.6335G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       1.43 us         1.43 us       486642
> bytes_per_second=21.3266G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         6.93 us         6.92 us       100720
> bytes_per_second=4.41046G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        3.14 us         3.14 us       222803
> bytes_per_second=9.72931G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        2.11 us         2.11 us       331388
> bytes_per_second=14.4907G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.32 us         1.32 us       532964
> bytes_per_second=23.0728G/s null_percent=0 size=32.768k
>
> I tried to tweak the kRoundFactor or using some unroll based omp simd, 
> or build with clang-8, unluckily I never can get the results up to intrinsic.
> The ASM code generated all use SIMD instructions, only some small 
> difference like instruction sequences or xmm register used. The things 
> under compiler is really some secret for me.
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Jed Brown <j...@jedbrown.org>
> Sent: Thursday, June 11, 2020 1:58 AM
> To: Du, Frank <frank...@intel.com>; dev@arrow.apache.org
> Subject: RE: [C++][Discuss] Approaches for SIMD optimizations
>
> "Du, Frank" <frank...@intel.com> writes:
>
> > The PR I committed provide a basic support for runtime dispatching. 
> > I agree that complier should generate good vectorize for the 
> > non-null data part but in fact it didn't, jedbrown point to it can 
> > force complier to SIMD using some additional pragmas, something like 
> > "#pragma omp simd reduction(+:sum)", I will try this pragma later 
> > but need figure out if it need a linking against OpenMP.
>
> It does not require linking OpenMP.  You just compile with 
> -fopenmp-simd
> (gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
> pragmas.  (These can be captured in macros using _Pragma.)
>
> Note that you get automatic vectorization for this sort of thing 
> without any OpenMP if you add -funsafe-math-optimizations (included in 
> -ffast-math).
>
>   https://gcc.godbolt.org/z/8thgru
>
> Many projects don't want -funsafe-math-optimizations because there are 
> places where it can hurt numerical stability.  ICC includes unsafe 
> math in normal optimization levels while GCC and Clang are more conservative.
>

Reply via email to