[GitHub] [arrow-datafusion] yjshen commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregaton

via GitHub Mon, 06 Mar 2023 19:38:08 -0800


yjshen commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1457463675


   > I think it is important to highlight that the proposal in 
https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1324876060 
is still a row hash approach, just using a different row encoding. Early 
benchmarks show it to be significantly faster.
   
   @tustvold There appears to be a discrepancy between your proposal and my 
current situation. Can you clarify how I should interpret this?
   
   > Hence the aggregator will be dyn-dispatched ONCE per record batch and will 
keep its own internal state. This moves the key->state map from the 
[row_]hash.rs to the aggregators.
   
   Do you know if it needs a hash table in each aggregator? What does the 
internal state mean? Can you elaborate
   
   If yes, what do you mean by "still a row hash approach"? Keys are rows, but 
aggregator states are not?
   If not, why @crepererum said: "We're trading speed for slightly more memory 
usage here....", and why "in the per-aggregator hash tables by either 
interning...."
   
   If the proposal is to remove all(word-aligned/compact) row formats inside 
DataFusion, use row-format from arrow-rs, and achieve a significant performance 
improvement, I completely vote for the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregaton

Reply via email to