alamb opened a new issue, #6798:
URL: https://github.com/apache/arrow-datafusion/issues/6798

   ### Is your feature request related to a problem or challenge?
   
   We are trying to make hash based aggregation significantly faster -- see 
https://github.com/apache/arrow-datafusion/issues/4973
   
   This will require some non trivial changes to the organization of how hash 
aggregation works. At the moment `BoundedAggregateStream` and 
`GroupedHashAggregateStream` both share significant amounts of code and so 
either we will have to duplicate the work to make hashing aggregation faster or 
else `BoundedAggregateStream` will not get the benefits. 
   
   Here is a visual depiction of the common code:
   
   ```shell
    meld 
datafusion/core/src/physical_plan/aggregates/bounded_aggregate_stream.rs 
datafusion/core/src/physical_plan/aggregates/row_hash.rs 
   ```
   
   ![Screenshot 2023-06-29 at 9 16 18 
AM](https://github.com/apache/arrow-datafusion/assets/490673/72d4f761-02df-44db-a2d4-2c7d1d125770)
   
   
   ### Describe the solution you'd like
   
   Reduce duplication between `BoundedAggregateStream` and 
`GroupedHashAggregateStream`
   
   The major differences are:
   1. Choice of when output can be emitted
   2. Clearing previous group state when groups have been emitted
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to