[GitHub] [arrow-datafusion] andygrove opened a new issue, #5944: Fuse grouped aggregate and filter operators for improved performance

via GitHub Mon, 10 Apr 2023 07:16:33 -0700


andygrove opened a new issue, #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944


   ### Is your feature request related to a problem or challenge?
   
   When we perform a grouped aggregate on a filtered input (such as with TPC-H 
q1), the filter operator performs two main tasks:
   
   - Evaluate the filter predicate (usually very fast)
   - Create new batches and copy over the filtered data (very slow if the 
filter is not very selective, as in q1)
   
   I wonder if we would see a significant performance improvement if we could 
avoid creating the filtered batches in this case.
   
   One idea would be to create the filtered batches by copying the arrays and 
mutating the validity bitmap to hide the rows that are filtered out. This would 
potentially change the semantics in some cases though so we can probably only 
do this under certain conditions.
   
   Another idea is to update the aggregate logic to perform the predicate 
evaluation and then use the resulting bitmap to determine which rows to 
accumulate.
   
   
   
   ### Describe the solution you'd like
   
   I am working on a small prototype of this, outside of DataFusion, that I 
will share once the code is less embarrassing.
   
   ### Describe alternatives you've considered
   
   It would be worth seeing how other engines handle this.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andygrove opened a new issue, #5944: Fuse grouped aggregate and filter operators for improved performance

Reply via email to