[GitHub] [arrow-datafusion] alamb commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

via GitHub Mon, 10 Apr 2023 10:21:15 -0700


alamb commented on issue #5944:
URL: 
https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1502083042


   > One idea would be to create the filtered batches by copying the arrays and 
mutating the validity bitmap to hide the rows that are filtered out. This would 
potentially change the semantics in some cases though so we can probably only 
do this under certain conditions.
   
   I think this basic idea is called a "selection vector" in the literature -- 
and as you hint at, it is not quite the same as the null mask as it has 
different semantics.
   
   One approach might be to add another enum type to `ColumnarValue` that had 
an additional validity mask
   
   
https://github.com/apache/arrow-datafusion/blob/bbc71692fcd8dd9f3a9686162e59d092b37031f2/datafusion/expr/src/columnar_value.rs#L33
   
   
   After @tustvold 's recent work in Arrow, I think this would just be a 
https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html and should 
be straightforward to use.
   
   To really take advantage of a selection vector, however, the underlying 
compute kernels need to be updated to know how to ignore the selection vectors 
(and likely only do so when they are sparse)
   
   
   > Another idea is to update the aggregate logic to perform the predicate 
evaluation and then use the resulting bitmap to determine which rows to 
accumulate.
   
   While not exactly the same,  @yjshen 's has been workking to add filtering 
to the aggregate input here, which is similar:  
https://github.com/apache/arrow-datafusion/pull/5868
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Reply via email to