2010YOUY01 commented on PR #23181: URL: https://github.com/apache/datafusion/pull/23181#issuecomment-4852351969
Thank you for the review, @alamb. The suggestions make sense to me, and I’ll try to address them later (likely tomorrow). > Longer term it would be nice to have some way to reduce the duplication (I realize some in inevitable, but there is a lot I think that is unecessary) I’m also quite tempted to reuse code here. For example, the partial-reduce variant could probably be implemented by adding a few flags to the existing partial aggregation path. - https://github.com/apache/datafusion/pull/23233 My main concern is that these paths may evolve into very different shapes in the future, due to optimizations specialized for each case. I want to keep that future evolution easy. One example is the ordered aggregation variant. For the fully ordered case, I realized we may not need a hash table at all. We could likely make it faster by finding the group split points first and then aggregating each group directly. So as a middle ground, I think we should: * Keep the execution paths (streams) separate when their semantics are different. * Extract common utility functions for code that is almost identical. If the paths need to evolve differently later, we can then stop sharing those utilities. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
