[GitHub] [arrow-datafusion] alamb opened a new pull request #808: (WIP) Rework GroupByHash to for faster performance and support grouping by nulls

GitBox Sun, 01 Aug 2021 05:56:29 -0700


alamb opened a new pull request #808:
URL: https://github.com/apache/arrow-datafusion/pull/808



   NOTE: this PR is WIP -- still todo:
   - [ ] Debugging
   - [ ] Measure performance
   - [ ] Handle hash collisions
   
   # Which issue does this PR close? 
   
   Closes https://github.com/apache/arrow-datafusion/issues/790 by implementing 
a new design for group by hash
   
   # Note
   built on https://github.com/apache/arrow-datafusion/pull/793
   
   
   
   
    # Rationale for this change
   1. Regain performance lost when we added support for GROUP BY NULL; See 
https://github.com/apache/arrow-datafusion/issues/790 for more details
   
   # What changes are included in this PR?
   1. Use a hash to to create the appropriate grouping, use indexes rather than 
hash keys many time
   
   # Performance
   
   ## Measurements
   In progress
   
   ## Notes
   This approach avoids the following operations which should improve its speed:
   1. Avoids copying GroupValues into a Vec to hash, saving both time and space
   2. Avoids several hash table lookups (used indexes into `group_values` 
instead
   
   # Are there any user-facing changes?
   Faster performance
   
   
   # Notes:
   I tried to keep the same names and structure of the existing hash algorithm 
(as I found that easy to follow -- nice work @Dandandan  and @andygrove ) and I 
think that will make this easier to review
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new pull request #808: (WIP) Rework GroupByHash to for faster performance and support grouping by nulls

Reply via email to