rluvaton commented on PR #15022:
URL: https://github.com/apache/datafusion/pull/15022#issuecomment-2819269461

   > The "ideal" flow in DF is to check for ordering during planning and choose 
specialized executors (and accumulators) based on this information. We don't do 
this in all cases yet, but that's how the code has been evolving.
   
   That is not always the case, Comet for example build `PhysicalPlan` directly 
and does not use the optimizer.
   
   > `GroupedHashAggregateStream` stores ordering-related information in the 
`group_ordering` attribute (of type `GroupOrdering`) and does some 
emission-related optimization with that. However, IIRC group accumulators do 
not assume any ordering information (and they are not optimized for cases when 
there is ordering).
   
   you are right, Group accumulator does not assume any ordering information.
   
   > If there are certain group accumulators that can benefit from ordering 
information, why don't we define a new class of group accumulators for those 
and instantiate them instead of the more general, non-ordering-assuming ones?
   
   because sometimes we need to transition from general to specific, like in 
the case of spilling. and if we initiate new group accumulator by using the 
`AggregateUDF` the implementation will be weird and any internal statistics 
that group accumulator might have saved will be lost.
   
   > In any case, we really should have a specific example (e.g. a query or a 
plan) as we collaborate to find the right solution to this need. It will help 
us avoid losing time to misunderstandings. Can you share a minimal benefiting 
example? Thanks.
   
   My specific example is GroupAccumulator implementation of `array_agg` with 
distinct when there is a lot of data that requires spill, the implementation 
can be optimized for merge phase as we can avoid saving HashSet for each group 
and only save for the last group.
   
   A specific optimization already implemented in this PR for 
`GroupsAccumulatorAdapter` that show when it can benefit
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to