[GitHub] [arrow-datafusion] Dandandan commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Wed, 04 Aug 2021 14:22:11 -0700


Dandandan commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892983559



   > @Dandandan if you have a moment, I would like to know if you have any 
concerns with the "change `create_hashes`" function item above, before I spend 
significant time on it
   
   I will try to have a better look at this later. The first feeling I have is 
that the example/proposal is:
   
   * More row-based than the `create_hashes` as it is today. The important part 
of a vectorized hashing is that the inner loop should be on on the same array 
with the same type, and not have to move memory locations and move to different 
parts of the code for hashing each row.
   * Creating/keeping the more complex `GroupKey` created per row, making 
creation of the keys (allocation per key / not cache friendly) and making 
re-hashing of the key more expensive (no simple or even identity function as 
hash)
   * Harder to be further vectorized. My belief is that using the Rust HashMap 
is not really the end state of the hash join and hash aggregate, but an easier 
way to implement it.
   
   It might be still an improvement over the current state (for hash 
aggregate), it looks like it simplifies some parts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to