2010YOUY01 commented on PR #22862:
URL: https://github.com/apache/datafusion/pull/22862#issuecomment-4668255459

   > 1. I don't like this approach because it's a workaround for the hash 
aggregate issue, which is going to be reworked; additionally, it's not accurate 
memory accounting: when the first batch arrives, it reserves the memory for all 
the subsequent batches; this reservation could fail, meaning that if the 
operator supports spilling, it would spill on every batch (because each batch 
carries the entire memory reservation with it)
   > 2. Since GroupedHashAggregateStream could have multiple downstream 
operators, is the plan to change the memory accouting for those as well?
   
   I agree that fixing the root issue in hash aggregate is the long-term 
solution.
   
   The idea that operators should avoid returning small sliced batches backed 
by large buffers sounds like a reasonable property to follow. `AggregateExec` 
is the only violation I'm aware of today, though I'm not sure such a property 
can always be guaranteed given the diverse requirements of downstream projects.
   
   So I think it's a good idea to be defensive in memory-intensive operators, 
and this PR makes the situation better (arguably in an entropy-reducing way). 
It also makes sense to apply the same approach to other memory-intensive 
operators.
   
   I'm wondering whether you have a better alternative in mind, or if there are 
other concerns that would suggest we should not merge this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to