[GitHub] [arrow-datafusion] yjshen commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregaton

via GitHub Mon, 06 Mar 2023 07:15:10 -0800


yjshen commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1456323067


   While I agree that we may need to revisit the current `Box<dyn Accumulator>` 
approach when creating a fallback or default strategy, I do not believe we 
should combine all hash aggregate methods into one universal approach. Given 
that we are constructing an essential operator for a database execution engine, 
it is important to consider specialized implementations that offer performance 
benefits for handling different use cases.
   
   I view row hash as a fast lane for certain queries rather than a dead end. 
By efficiently executing queries through row hash, we can avoid slower or more 
general approaches. Furthermore, I am open to exploring other specialized 
aggregate implementations that offer performance benefits to handle more use 
cases quickly.
   
   For example, Spark provides multiple hash aggregate implementations, 
including vectorized hash, hash, and object hash. Similarly, ClickHouse offers 
different implementation options for its hash table, based on key types. 
Utilizing these specialized implementations can optimize query performance 
without relying on a one-size-fits-all approach.
   
   
   > It should be easy to add types and aggregations
   
   Regarding the proposed criteria for a fallback strategy, I believe that 
while it should be easy to add types and aggregations, not all internal 
implementations should be treated equally, or else we risk the entire engine 
running at the speed of UDAFs.
   
   > It should work w/ memory accounting
   
   While it would be beneficial for the fallback strategy to work with memory 
accounting, it is not strictly necessary. Users can take risks when defining 
UDAFs, or we could even risk the less-used built-in functions.
   
   > Data accesses should be properly typed (in contrast to the aligned row 
format)
   
   I'm not convinced that data accesses need to be typed for aggregate 
implementations, which is entirely internal to database engines.
   
   > The solution should consider variable-length types. These cannot be added 
as an afterthought.
   
   I understand that you were referring to the aligned format, but it is 
possible to handle variable length fields using 'pointer swizzling.' 
Additionally, I don't believe that `median` is a performance-critical aggregate 
function that needs to be considered a first-class citizen for all aggregate 
strategies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregaton

Reply via email to