2010YOUY01 commented on issue #23178: URL: https://github.com/apache/datafusion/issues/23178#issuecomment-4796770732
Thank you for the great explanation! this makes sense to me. > Despite that, `GroupValuesColumn::emit(EmitTo::First(n))` still maintains lookup state for future interning. It calls `HashTable::retain` and rewrites remaining group indexes on every output batch. >... > Add an explicit terminal-output transition, for example: > ``` > GroupValues::release_interning_state() > ``` > called from `AggregateHashTable::start_outputting()`. We're about to re-implement all `GroupValues` for https://github.com/apache/datafusion/issues/7065, and this would also require some API changes to `GroupValues`. So I suggest we delay fixing the root issue until then, and avoid changing `GroupValues` for now. We will eventually move the aggregation code to the new implementation for: - https://github.com/apache/datafusion/issues/22710 So I suggest we keep evolving the new implementation. For now, I think the best solution is the hack fix you proposed: > A short-term fallback is to materialize all output once and slice it when the estimated output size is safe, but that has higher peak-memory risk. I'm happy to implement it, but if you're interested in implementing it, I'd also be happy to review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
