2010YOUY01 commented on PR #22862: URL: https://github.com/apache/datafusion/pull/22862#issuecomment-4668255459
> 1. I don't like this approach because it's a workaround for the hash aggregate issue, which is going to be reworked; additionally, it's not accurate memory accounting: when the first batch arrives, it reserves the memory for all the subsequent batches; this reservation could fail, meaning that if the operator supports spilling, it would spill on every batch (because each batch carries the entire memory reservation with it) > 2. Since GroupedHashAggregateStream could have multiple downstream operators, is the plan to change the memory accouting for those as well? I agree that fixing the root issue in hash aggregate is the long-term solution. The idea that operators should avoid returning small sliced batches backed by large buffers sounds like a reasonable property to follow. `AggregateExec` is the only violation I'm aware of today, though I'm not sure such a property can always be guaranteed given the diverse requirements of downstream projects. So I think it's a good idea to be defensive in memory-intensive operators, and this PR makes the situation better (arguably in an entropy-reducing way). It also makes sense to apply the same approach to other memory-intensive operators. I'm wondering whether you have a better alternative in mind, or if there are other concerns that would suggest we should not merge this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
