alamb commented on PR #18873:
URL: https://github.com/apache/datafusion/pull/18873#issuecomment-3606910410

   > You piqued my interest with why this is slow.
   > 
   > couple of questions:
   > 
   > 1. what are the sized of the left and right boolean buffers? maybe they 
are very large and each copy is expensive
   > 2. who produce the selection masks? can they reuse a mutable boolean 
buffer?
   > 
   > ideas:
   > 
   > 1. you could try to reuse the same buffer when combining selection masks 
and thus avoid copy every time
   > 2. keep track of some estimate of how many true exists in the selection 
mask for each `and_then`
   >    for large number of true and large number right selection mask you 
should work in chunks rather than bits
   > 3. keep some kind of data struct that let you track whether it is better 
to do
   
   Thanks @rluvaton 
   
   I think the sizes are typically the batch size (8192 rows)
   
   the masks come from 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html
 (which DataFusion provdes)
   
   I think reason it is currently slower is that the BooleanArrays are 
converted back to RowSelections always -- specifically 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.from_filters
   
   For patterns with many small selections, this is much worse and takes a lot 
of time
   
   This is basically what I am working on avoding in 
https://github.com/apache/arrow-rs/pull/8902
   
   The ideas are good. I will try and incorporate them in keep track of some 
estimate of how many true exists in the selection mask for each and_then
   for large number of true and large number right selection mask you should 
work in chunks rather than bits


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to