wesm opened a new pull request #7442:
URL: https://github.com/apache/arrow/pull/7442


   NOTE: the diff is artificially larger due to some code rearranging (that was 
necessitated because of how some data selection code is shared between the Take 
and Filter implementations).
   
   Summary:
   
   * Filter is now 1.5-6x faster across the board, most notably on primitive 
types with high selectivity filters. The BitBlockCounters do a lot of the heavy 
lifting in that case but even in the worst case scenario when the block 
counters never encounter a "full" block, this is still consistently faster.
   * Total -O3 code size for **both** Take and Filter is now about 600KB. 
That's down from about 8MB total prior to this patch and ARROW-5760
   
   Some incidental changes:
   * Implemented a fast conversion from boolean filter to take indices (aka 
"selection vector"),  `compute::internal::GetTakeIndices`. I have also altered 
the implementation of filtering a record batch to use this, which should be 
faster (it would be good to have some benchmarks to confirm this). 
   * Various expansions to the BitBlockCounter classes that I needed to support 
this work
   * Fixed a bug ARROW-9142 with RandomArrayGenerator::Boolean. The probability 
parameter was being interpreted as the probability of a false value rather than 
the probability of a true. IIUC with Bernoulli distributions, the probability 
specified is P(X = 1) not P(X = 0). Please someone confirm this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to