[
https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308410#comment-17308410
]
Yibo Cai commented on ARROW-9842:
---------------------------------
[GenerateBitsUnrolled|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_generate.h#L63]
is called to do the actual job. Manually written _mm_movemask_epi8 does
improve compare kernel performance significantly for integers. A quick POC
patch is attached: [^movemask.patch]
*NOTE:* this patch only improves operations with trivial generator (integer
compare), which is easily vectorizable. For non-trivial generators (string
compare ), it hurts performance. And GenerateBits benchmark itself also drops.
My gut feeling is it's better to tweak C++ code and let compiler do the dirty
job.
{noformat}
$ archery benchmark diff
--suite-filter="arrow-compute-scalar-compare-benchmark" --cc=clang-9
--cxx=clang++-9
-----------------------------------------------------------------------------------------------
Non-regressions: (17)
-----------------------------------------------------------------------------------------------
benchmark baseline contender
change % counters
GreaterArrayArrayInt64/32768/100 1.889G items/sec 3.119G items/sec
65.101 {}
GreaterArrayArrayInt64/32768/0 1.942G items/sec 3.175G items/sec
63.429 {}
GreaterArrayArrayInt64/32768/10000 1.901G items/sec 3.102G items/sec
63.169 {}
GreaterArrayArrayInt64/32768/10 1.916G items/sec 3.126G items/sec
63.147 {}
GreaterArrayArrayInt64/32768/2 1.909G items/sec 3.104G items/sec
62.651 {}
GreaterArrayArrayInt64/32768/1 1.923G items/sec 3.128G items/sec
62.630 {}
GreaterArrayScalarInt64/32768/10 2.715G items/sec 3.206G items/sec
18.079 {}
GreaterArrayScalarInt64/32768/10000 2.700G items/sec 3.176G items/sec
17.631 {}
GreaterArrayScalarInt64/32768/1 2.744G items/sec 3.194G items/sec
16.386 {}
GreaterArrayScalarInt64/32768/100 2.733G items/sec 3.176G items/sec
16.192 {}
GreaterArrayScalarInt64/32768/2 2.746G items/sec 3.184G items/sec
15.952 {}
GreaterArrayScalarInt64/32768/0 2.772G items/sec 3.208G items/sec
15.719 {}
GreaterArrayArrayString/32768/10000 115.583M items/sec 115.174M items/sec
-0.354 {}
GreaterArrayArrayString/32768/0 116.099M items/sec 115.286M items/sec
-0.700 {}
GreaterArrayArrayString/32768/100 115.595M items/sec 113.561M items/sec
-1.760 {}
GreaterArrayScalarString/32768/2 210.101M items/sec 204.424M items/sec
-2.702 {}
GreaterArrayArrayString/32768/10 117.269M items/sec 113.311M items/sec
-3.375 {}
------------------------------------------------------------------------------------------------
Regressions: (7)
------------------------------------------------------------------------------------------------
benchmark baseline contender
change % counters
GreaterArrayArrayString/32768/2 136.027M items/sec 126.051M items/sec
-7.334 {}
GreaterArrayScalarString/32768/0 603.706M items/sec 524.915M items/sec
-13.051 {}
GreaterArrayScalarString/32768/10000 603.305M items/sec 521.166M items/sec
-13.615 {}
GreaterArrayScalarString/32768/100 580.450M items/sec 497.031M items/sec
-14.371 {}
GreaterArrayScalarString/32768/10 437.081M items/sec 368.205M items/sec
-15.758 {}
GreaterArrayScalarString/32768/1 812.923M items/sec 675.911M items/sec
-16.854 {}
GreaterArrayArrayString/32768/1 613.903M items/sec 441.708M items/sec
-28.049 {}
{noformat}
> [C++] Explore alternative strategy for Compare kernel implementation for
> better performance
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-9842
> URL: https://issues.apache.org/jira/browse/ARROW-9842
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 5.0.0
>
> Attachments: movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of
> results is deferred until the end (or in chunks). Instead, a temporary
> bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can
> be bitpacked into the output buffer. This may also reduce the code size of
> the compare kernels (which are actually quite large at the moment)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)