It seems the code for the naive Scalar example is not friendly with the compiler auto-vectorization component. If you accumulate in a local state (instead of SumState pointer), you'll get different results. at least with clang++6.0.
benchmark-noavx (only SSE): BM_SumInt32Scalar 42390 us 42389 us 16 bytes_per_second=8.78824G/s items_per_second=2.35908G/s BM_SumInt32ScalarRegister 28036 us 28035 us 25 bytes_per_second=13.2878G/s items_per_second=3.56691G/s BM_SumInt32ScalarUnrolled 31660 us 31659 us 22 bytes_per_second=11.7668G/s items_per_second=3.15863G/s BM_SumInt32ScalarSentinel 49541 us 49540 us 14 bytes_per_second=7.51971G/s items_per_second=2.01856G/s BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12 bytes_per_second=6.69369G/s items_per_second=1.79683G/s BM_SumInt64Scalar 65030 us 65030 us 11 bytes_per_second=11.4572G/s items_per_second=1.53776G/s BM_SumInt64ScalarRegister 55370 us 55369 us 13 bytes_per_second=13.4563G/s items_per_second=1.80607G/s BM_SumInt64ScalarUnrolled 61287 us 61286 us 11 bytes_per_second=12.1571G/s items_per_second=1.6317G/s BM_SumInt64ScalarSentinel 69792 us 69792 us 10 bytes_per_second=10.6755G/s items_per_second=1.43284G/s BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10 bytes_per_second=10.89G/s items_per_second=1.46164G/ benchmark-native (AVX512): BM_SumInt32Scalar 43981 us 43980 us 16 bytes_per_second=8.47046G/s items_per_second=2.27377G/s BM_SumInt32ScalarRegister 28381 us 28380 us 24 bytes_per_second=13.1263G/s items_per_second=3.52355G/s BM_SumInt32ScalarUnrolled 79035 us 79033 us 9 bytes_per_second=4.71361G/s items_per_second=1.2653G/s BM_SumInt32ScalarSentinel 53430 us 53429 us 13 bytes_per_second=6.97235G/s items_per_second=1.87163G/s BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22 bytes_per_second=11.4802G/s items_per_second=3.0817G/s BM_SumInt64Scalar 68112 us 68111 us 10 bytes_per_second=10.9389G/s items_per_second=1.46819G/s BM_SumInt64ScalarRegister 56188 us 56187 us 12 bytes_per_second=13.2603G/s items_per_second=1.77977G/s BM_SumInt64ScalarUnrolled 65649 us 65648 us 11 bytes_per_second=11.3492G/s items_per_second=1.52327G/s BM_SumInt64ScalarSentinel 69674 us 69673 us 10 bytes_per_second=10.6937G/s items_per_second=1.43528G/s BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12 bytes_per_second=12.5857G/s items_per_second=1.68922G/s The interesting part here is that the ScalarSentinelRegister version is very close the the fully vectorized ScalarRegister version on AVX512. Antoine, the compiler is doing exactly what you're suggesting: 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11 │ static inline bool NotNull(int64_t val) { return val != null_sentinel; } │ vpcmpneqq %ymm13,%ymm8,%k2 25 │ vpcmpneqq %ymm13,%ymm9,%k3 20 │ vpcmpneqq %ymm13,%ymm10,%k4 72 │ vpcmpneqq %ymm13,%ymm11,%k1 │ if (Traits<T>::NotNull(*values)) { │ vpbroadcastq %rbx,%ymm12{%k2}{z} 15 │ vpaddq %ymm12,%ymm3,%ymm3 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z} 57 │ vpaddq %ymm12,%ymm5,%ymm5 │ vpbroadcastq %rbx,%ymm12{%k4}{z} 39 │ vpaddq %ymm12,%ymm6,%ymm6 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z} 88 │ vpaddq %ymm12,%ymm7,%ymm7 See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I didn't have time to look at the other Sum (bitmap) implementation and didn't have time to clean the code. I'll try improve this over the weekend. On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi folks, > > I explored a bit the performance implications of using validity > bitmaps (like the Arrow columnar format) vs. sentinel values (like > NaN, INT32_MIN) for nulls: > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ > > The vectorization results may be of interest to those implementing > analytic functions targeting the Arrow memory format. There's probably > some other optimizations that can be employed, too. > > Caveat: it's entirely possible I made some mistakes in my code. I > checked the various implementations for correctness only, and did not > dig too deeply beyond that. > > - Wes > -- Sent from my jetpack.