[
https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310428#comment-17310428
]
Yibo Cai commented on ARROW-9842:
---------------------------------
Updated POC patch to evaluate best chunk size to accumulate bits before packing.
[^movemask-in-chunks.diff]
To my surprise, big chunks actually hurts performance (tested with
arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec,
while chunk size 256 gives ~2G.
My theory is for big chunk size, cpu has to stall for memory loading. For small
chunk size, cpu can interleave memory loading and latter computation (packing
bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when
chunk size increases from 16 to 64.
> [C++] Explore alternative strategy for Compare kernel implementation for
> better performance
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-9842
> URL: https://issues.apache.org/jira/browse/ARROW-9842
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 5.0.0
>
> Attachments: movemask-in-chunks.diff, movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of
> results is deferred until the end (or in chunks). Instead, a temporary
> bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can
> be bitpacked into the output buffer. This may also reduce the code size of
> the compare kernels (which are actually quite large at the moment)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)