[ https://issues.apache.org/jira/browse/ARROW-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney resolved ARROW-9029. --------------------------------- Resolution: Fixed Issue resolved by pull request 7346 [https://github.com/apache/arrow/pull/7346] > [C++] Implement BitBlockCounter interface for blockwise popcounts of validity > bitmaps > ------------------------------------------------------------------------------------- > > Key: ARROW-9029 > URL: https://issues.apache.org/jira/browse/ARROW-9029 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > In analytics, it is common for data to be all not-null or mostly not-null. > Data with > 50% nulls tends to be more exceptional. In this light, our > {{BitmapReader}} class which allows iteration of each bit in a bitmap can be > computationally suboptimal for mostly set validity bitmaps. > I propose instead a new interface for use in kernel implementations, for lack > of a better term {{BitmapScanner}}. This works as follows: > * Uses hardware popcount to compute the number of set values in 256 bits at a > time (or whatever is the right window size). > * Code can use the returned "run" (length + # of set bits) to switch between > nullable/non-nullable code paths > For data with a lot of nulls, this may degrade performance somewhat but > probably not that much empirically. However, data that is mostly-not-null > should benefit from this. > This BitmapScanner utility can probably also be used to accelerate the > implementation of Filter for mostly-not-null data > I tried some other things that were slower (like trying to find the largest > consecutive run of all-set words) before doing the simple 256-bit popcount > solution. -- This message was sent by Atlassian Jira (v8.3.4#803005)