[GitHub] [arrow-datafusion] Dandandan commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Mon, 02 Aug 2021 11:17:15 -0700


Dandandan commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-891233423



   > > Collisions are handled by the fact that the entry in the hash table is a 
_list_ of indicies into the mutable area -- if there are multiple values in 
that list each entry in the mutable area needs to be checked for equality to 
find the correct one.
   > 
   > I don't really see how we can to this equality checks faster than the 
hashmap itself could do it, but I would be very happy to be proven wrong.
   > 
   > For optimized eq and hash handling, hashbrown has some more low-level 
functionality that we are not yet using:
   > 
   > * [`from_hash(hash, 
eq_fn)`](https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/struct.RawEntryBuilderMut.html#method.from_hash):
 allows looking up an entry with a given hash and equals function, bypassing 
the hashbuilder of the map. This allows implementing eq on values that are not 
directly stored in the map, for example if only an index into another structure 
is part of the key.
   > * [`insert_with_hasher(hash, key, value, 
hasher)`](https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/struct.RawVacantEntryMut.html#method.insert_with_hasher):
 Again allows bypassing the hashbuilder and looking up the hash value in some 
other structure. The `hasher` parameter is only called if the table needs 
rehashing.
   
   `from_hash` might be indeed interesting to explore.
   
   Some more info of why collision might be not that bad, even with an 
alternative approach:
   
   * Based on good quality `u64` hashes, I found the remaining number 
collisions to be very low, as most of the collisions (with different `u64` but 
within the same hashmap "bucket") are already handled in the hashmap 
implementation. I had to implement some tests with fixed hashes to trigger hash 
collisions in the hash join algorithm as the current approach already gave the 
correct answers on all of the unit / integration tests without detecting 
collisions, e.g. each key maps to a different `u64` hash value in the tests.
   * For hash join there are some known ways to do hash collisions in a 
vectorized way e.g. see this article 
https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ . this might be 
similar for hash aggregations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to