2010YOUY01 commented on issue #23178:
URL: https://github.com/apache/datafusion/issues/23178#issuecomment-4796770732

   Thank you for the great explanation! this makes sense to me.
   
   > Despite that, `GroupValuesColumn::emit(EmitTo::First(n))` still maintains 
lookup state for future interning. It calls `HashTable::retain` and rewrites 
remaining group indexes on every output batch.
   >...
   
   > Add an explicit terminal-output transition, for example:
   > ```
   > GroupValues::release_interning_state()
   > ```
   > called from `AggregateHashTable::start_outputting()`.
   
   We're about to re-implement all `GroupValues` for 
https://github.com/apache/datafusion/issues/7065, and this would also require 
some API changes to `GroupValues`. So I suggest we delay fixing the root issue 
until then, and avoid changing `GroupValues` for now.
   
   We will eventually move the aggregation code to the new implementation for:
   
   - https://github.com/apache/datafusion/issues/22710
   
   So I suggest we keep evolving the new implementation. For now, I think the 
best solution is the hack fix you proposed:
   
   > A short-term fallback is to materialize all output once and slice it when 
the estimated output size is safe, but that has higher peak-memory risk.
   
   I'm happy to implement it, but if you're interested in implementing it, I'd 
also be happy to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to