rluvaton commented on PR #15022: URL: https://github.com/apache/datafusion/pull/15022#issuecomment-2819269461
> The "ideal" flow in DF is to check for ordering during planning and choose specialized executors (and accumulators) based on this information. We don't do this in all cases yet, but that's how the code has been evolving. That is not always the case, Comet for example build `PhysicalPlan` directly and does not use the optimizer. > `GroupedHashAggregateStream` stores ordering-related information in the `group_ordering` attribute (of type `GroupOrdering`) and does some emission-related optimization with that. However, IIRC group accumulators do not assume any ordering information (and they are not optimized for cases when there is ordering). you are right, Group accumulator does not assume any ordering information. > If there are certain group accumulators that can benefit from ordering information, why don't we define a new class of group accumulators for those and instantiate them instead of the more general, non-ordering-assuming ones? because sometimes we need to transition from general to specific, like in the case of spilling. and if we initiate new group accumulator by using the `AggregateUDF` the implementation will be weird and any internal statistics that group accumulator might have saved will be lost. > In any case, we really should have a specific example (e.g. a query or a plan) as we collaborate to find the right solution to this need. It will help us avoid losing time to misunderstandings. Can you share a minimal benefiting example? Thanks. My specific example is GroupAccumulator implementation of `array_agg` with distinct when there is a lot of data that requires spill, the implementation can be optimized for merge phase as we can avoid saving HashSet for each group and only save for the last group. A specific optimization already implemented in this PR for `GroupsAccumulatorAdapter` that show when it can benefit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org