It seems the code for the naive Scalar example is not friendly with the
compiler auto-vectorization component. If you accumulate in a local state
(instead of SumState pointer), you'll get different results. at least with
clang++6.0.

benchmark-noavx (only SSE):
BM_SumInt32Scalar                      42390 us      42389 us         16
bytes_per_second=8.78824G/s items_per_second=2.35908G/s
BM_SumInt32ScalarRegister              28036 us      28035 us         25
bytes_per_second=13.2878G/s items_per_second=3.56691G/s
BM_SumInt32ScalarUnrolled              31660 us      31659 us         22
bytes_per_second=11.7668G/s items_per_second=3.15863G/s
BM_SumInt32ScalarSentinel              49541 us      49540 us         14
bytes_per_second=7.51971G/s items_per_second=2.01856G/s
BM_SumInt32ScalarSentinelRegister      55655 us      55654 us         12
bytes_per_second=6.69369G/s items_per_second=1.79683G/s
BM_SumInt64Scalar                      65030 us      65030 us         11
bytes_per_second=11.4572G/s items_per_second=1.53776G/s
BM_SumInt64ScalarRegister              55370 us      55369 us         13
bytes_per_second=13.4563G/s items_per_second=1.80607G/s
BM_SumInt64ScalarUnrolled              61287 us      61286 us         11
bytes_per_second=12.1571G/s items_per_second=1.6317G/s
BM_SumInt64ScalarSentinel              69792 us      69792 us         10
bytes_per_second=10.6755G/s items_per_second=1.43284G/s
BM_SumInt64ScalarSentinelRegister      68418 us      68416 us         10
bytes_per_second=10.89G/s items_per_second=1.46164G/

benchmark-native (AVX512):
BM_SumInt32Scalar                      43981 us      43980 us         16
bytes_per_second=8.47046G/s items_per_second=2.27377G/s
BM_SumInt32ScalarRegister              28381 us      28380 us         24
bytes_per_second=13.1263G/s items_per_second=3.52355G/s
BM_SumInt32ScalarUnrolled              79035 us      79033 us          9
bytes_per_second=4.71361G/s items_per_second=1.2653G/s
BM_SumInt32ScalarSentinel              53430 us      53429 us         13
bytes_per_second=6.97235G/s items_per_second=1.87163G/s
BM_SumInt32ScalarSentinelRegister      32450 us      32450 us         22
bytes_per_second=11.4802G/s items_per_second=3.0817G/s
BM_SumInt64Scalar                      68112 us      68111 us         10
bytes_per_second=10.9389G/s items_per_second=1.46819G/s
BM_SumInt64ScalarRegister              56188 us      56187 us         12
bytes_per_second=13.2603G/s items_per_second=1.77977G/s
BM_SumInt64ScalarUnrolled              65649 us      65648 us         11
bytes_per_second=11.3492G/s items_per_second=1.52327G/s
BM_SumInt64ScalarSentinel              69674 us      69673 us         10
bytes_per_second=10.6937G/s items_per_second=1.43528G/s
BM_SumInt64ScalarSentinelRegister      59200 us      59199 us         12
bytes_per_second=12.5857G/s items_per_second=1.68922G/s

The interesting part here is that the ScalarSentinelRegister version is
very close the the fully vectorized ScalarRegister version on AVX512.
Antoine, the compiler is doing exactly what you're suggesting:

374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9

 48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10

230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11

    │       static inline bool NotNull(int64_t val) { return val !=
null_sentinel; }
    │       vpcmpneqq %ymm13,%ymm8,%k2

 25 │       vpcmpneqq %ymm13,%ymm9,%k3

 20 │       vpcmpneqq %ymm13,%ymm10,%k4

 72 │       vpcmpneqq %ymm13,%ymm11,%k1

    │           if (Traits<T>::NotNull(*values)) {

    │       vpbroadcastq %rbx,%ymm12{%k2}{z}

 15 │       vpaddq %ymm12,%ymm3,%ymm3

  9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}

 57 │       vpaddq %ymm12,%ymm5,%ymm5

    │       vpbroadcastq %rbx,%ymm12{%k4}{z}

 39 │       vpaddq %ymm12,%ymm6,%ymm6

 22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}

 88 │       vpaddq %ymm12,%ymm7,%ymm7

See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
didn't have time to look at the other Sum (bitmap) implementation and
didn't have time to clean the code. I'll try improve this over the weekend.

On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi folks,
>
> I explored a bit the performance implications of using validity
> bitmaps (like the Arrow columnar format) vs. sentinel values (like
> NaN, INT32_MIN) for nulls:
>
> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>
> The vectorization results may be of interest to those implementing
> analytic functions targeting the Arrow memory format. There's probably
> some other optimizations that can be employed, too.
>
> Caveat: it's entirely possible I made some mistakes in my code. I
> checked the various implementations for correctness only, and did not
> dig too deeply beyond that.
>
> - Wes
>


-- 
Sent from my jetpack.

Reply via email to