[GitHub] [arrow-datafusion] Dandandan opened a new issue #338: Speed up `finalize_aggregation` and `create_batch_from_map`

GitBox Fri, 14 May 2021 01:18:50 -0700


Dandandan opened a new issue #338:
URL: https://github.com/apache/arrow-datafusion/issues/338



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently `.to_array()` is called on each scalar value which is slow and 
generates a lot of allocations.
   This causes two things:
   * There is overhead for generating arrays in this way.
   * The single-row arrays are concatenated afterwards at the end, which is 
slow and would be unnecessary if they are 
   * Intermediate `Vecs` are generated, causing more memory usage / allocations 
/ fragmentation.
   
   I expect this should speed up some db-benchmark queries (group by queries 
with smaller groups) considerably and may decrease memory usage by quite a bit.
   
   **Describe the solution you'd like**
   Iterate over the values and emit arrays of `batch_size` elements at once.
   Or as a first step just do it for all of the values (as is the case 
currently) - and emit smaller batches in a later PR.
   
   To do it with `batch_size` there should be some state and/or remove the 
groups from the map.
   
   **Describe alternatives you've considered**
   n/a
   
   **Additional context**
   n/a


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan opened a new issue #338: Speed up `finalize_aggregation` and `create_batch_from_map`

Reply via email to