Wes McKinney created ARROW-9029: ----------------------------------- Summary: [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data Key: ARROW-9029 URL: https://issues.apache.org/jira/browse/ARROW-9029 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0
In analytics, it is common for data to be all not-null or mostly not-null. Data with > 50% nulls tends to be more exceptional. In this might, our {{BitmapReader}} class which allows iteration of each bit in a bitmap can be wasteful for mostly set validity bitmaps. I propose instead a new interface for use in kernel implementations, for lack of a better term {{BitmapScanner}}. This works as follows: * Uses popcount to accumulate consecutive 64-bit words from a bitmap where all values are set, up to some limit (e.g. anywhere from 8 to 128 words -- we can use benchmarks to determine what is a good limit). The length of this "all-on" run is returned to the caller in a single function call, so that this "run" of data can be processed without any bit-by-bit bitmap checking * If words containing unset bits is encountered, the scanner will similarly accumulate non-full words until the next full word is encountered or a limit is hit. The length of this "has nulls" run is returned to the caller, which then proceeds bit-by-bit to process the data For data with a lot of nulls, this may degrade performance somewhat but probably not that much empirically. However, data that is mostly-not-null should benefit from this. This BitmapScanner utility can probably also be used to accelerate the implementation of Filter for mostly-not-null data -- This message was sent by Atlassian Jira (v8.3.4#803005)