yordan-pavlov commented on pull request #8960:
URL: https://github.com/apache/arrow/pull/8960#issuecomment-748325991


   @jorgecarleitao  these are some great performance improvements when multiple 
arrays are filtered - this should have great performance when filtering a 
record batch containing many columns. I imagine this is explained by doing more 
work in advance, when building the filter, and less work when applying the 
filter to each array (compared to the previous implementation with the filter 
context). 
   
   The performance degradation in the `filter u8` is interesting - do you have 
a hypothesis for what's causing this? I wonder if this could be explained again 
by this new implementation doing more work in advance, which works very well 
when filtering multiple columns but is a bit slower when filtering a single 
column.
   
   Also I would expect the benchmarks with highly selective filters (mostly 0s 
in the filter array) to be faster (as there is more skipping and less copying), 
compared to the low selectivity filter (mostly 1s in the filter array) 
benchmarks (because of more copying and less skipping), but this relationship 
appears to be reversed in the results above.
   
   I also wonder how repeatable the benchmarks are now that they use randomly 
generated arrays. What are your observations; are the benchmarks results fairly 
stable across multiple runs?
   
   I also like how the filter kernel is now implemented using the 
`BitChunkIterator`; overall great work!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to