alamb commented on PR #18873: URL: https://github.com/apache/datafusion/pull/18873#issuecomment-3606910410
> You piqued my interest with why this is slow. > > couple of questions: > > 1. what are the sized of the left and right boolean buffers? maybe they are very large and each copy is expensive > 2. who produce the selection masks? can they reuse a mutable boolean buffer? > > ideas: > > 1. you could try to reuse the same buffer when combining selection masks and thus avoid copy every time > 2. keep track of some estimate of how many true exists in the selection mask for each `and_then` > for large number of true and large number right selection mask you should work in chunks rather than bits > 3. keep some kind of data struct that let you track whether it is better to do Thanks @rluvaton I think the sizes are typically the batch size (8192 rows) the masks come from https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html (which DataFusion provdes) I think reason it is currently slower is that the BooleanArrays are converted back to RowSelections always -- specifically https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.from_filters For patterns with many small selections, this is much worse and takes a lot of time This is basically what I am working on avoding in https://github.com/apache/arrow-rs/pull/8902 The ideas are good. I will try and incorporate them in keep track of some estimate of how many true exists in the selection mask for each and_then for large number of true and large number right selection mask you should work in chunks rather than bits -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
