yjshen commented on issue #4973: URL: https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1457463675
> I think it is important to highlight that the proposal in https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1324876060 is still a row hash approach, just using a different row encoding. Early benchmarks show it to be significantly faster. @tustvold There appears to be a discrepancy between your proposal and my current situation. Can you clarify how I should interpret this? > Hence the aggregator will be dyn-dispatched ONCE per record batch and will keep its own internal state. This moves the key->state map from the [row_]hash.rs to the aggregators. Do you know if it needs a hash table in each aggregator? What does the internal state mean? Can you elaborate If yes, what do you mean by "still a row hash approach"? Keys are rows, but aggregator states are not? If not, why @crepererum said: "We're trading speed for slightly more memory usage here....", and why "in the per-aggregator hash tables by either interning...." If the proposal is to remove all(word-aligned/compact) row formats inside DataFusion, use row-format from arrow-rs, and achieve a significant performance improvement, I completely vote for the change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
