[GitHub] [arrow-datafusion] alamb commented on issue #6937: Improve Memory usage with large numbers of groups

via GitHub Wed, 16 Aug 2023 14:53:26 -0700


alamb commented on issue #6937:
URL: 
https://github.com/apache/arrow-datafusion/issues/6937#issuecomment-1681317933


   > Doing that reduces the memory usage, but often with higher cost, which can 
be seen in the benchmark:
   
   Maybe we can get the performance back somehow (like make the output creation 
faster somehow) 🤔
   
   Alternately, we could consider making a single group operator that does the 
two phase grouping within itself
   
   so instead of
   ```
   group by (final)
     repartition
       group by (initil)
   ```
   
   We would have
   
   ```
   group by
   ```
   
   And do the repartitioning within the operator itself (and thus if the first 
phase isn't helping, we can switch to the second phase)
   
   This might impact downstream projects like ballista that want to distribute 
the first phase, however 🤔    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #6937: Improve Memory usage with large numbers of groups

Reply via email to