[GitHub] [arrow-datafusion] milenkovicm opened a new issue, #3460: Row Hash loads whole aggregation state to memory before sending

GitBox Tue, 13 Sep 2022 03:29:40 -0700


milenkovicm opened a new issue, #3460:
URL: https://github.com/apache/arrow-datafusion/issues/3460


   **Describe the bug**
   
   Row Hash aggregation, loads whole aggregation state to memory before sending 
a single batch downstream. The resulting record batch will have more rows than 
predefined batch size
   
   problematic part of code 
https://github.com/milenkovicm/arrow-datafusion/blob/17f069df4227b837cf2741a545c39a8b68d5fd76/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L438
   where iterator without limits is crated, and whole state is cloned, which 
doubles memory needed for the aggregation state.
   
   function `poll_next` creates single batch 
https://github.com/milenkovicm/arrow-datafusion/blob/17f069df4227b837cf2741a545c39a8b68d5fd76/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L146
   
   **To Reproduce**
   
   Run an aggregation
   
   **Expected behavior**
   
   Resulting aggregation should be chunked according to the predefined batch 
size 
   
   **Additional context**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] milenkovicm opened a new issue, #3460: Row Hash loads whole aggregation state to memory before sending

Reply via email to