Samyak2 commented on PR #22862: URL: https://github.com/apache/datafusion/pull/22862#issuecomment-4668820557
> 1. I don't like this approach because it's a workaround for the hash aggregate issue, which is going to be reworked; additionally, it's not accurate memory accounting: when the first batch arrives, it reserves the memory for all the subsequent batches; this reservation could fail, meaning that if the operator supports spilling, it would spill on every batch (because each batch carries the entire memory reservation with it) Continuing from https://github.com/apache/datafusion/issues/22526#issuecomment-4668727017, this happens with the current memory tracking too. I see this PR as a strict improvement over the current `get_record_batch_memory_size` usage. > 2. Since GroupedHashAggregateStream could have multiple downstream operators, is the plan to change the memory accouting for those as well? I would imagine so. As an example, the same problem currently exists with Repartition over Aggregate. The same fix would be needed in any operator that stores a sequence of `RecordBatch`es. For Repartition specifically, we would need to extend this API to be able to remove a record batch too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
