[GitHub] [arrow-datafusion] alamb edited a comment on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Wed, 04 Aug 2021 10:03:06 -0700


alamb edited a comment on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892767118



   # Summary
   TLDR; If it is ok to change `create_hashes` to use `std::hash::Hasher` to be 
consistent with `ScalarValue::hash` I believe we can avoid "signatures" and 
hashbrown's `HashMap` directly. This will:
   1. Handle hash collisions without any datafusion specific code
   2. Hash key values once for each input row (and once for each new group key)
   3. Copy key valyes once for each group
   
   # Discussion
   
   Based on the great suggestion from @jhorstmann, I spent some time playing 
with hashbrown's `from_hash`, to try and:
   1. Avoid double hashing as mentioned by @rdettai 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888956349
   2. Let hashbrown handle collisions, as mentioned by @jhorstmann: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-890995232
   3. Avoid the storage overhead of storing a synthetic key / signature
   
   
   The approach was to create a `HashMap` keyed off a list of `ScalarValues`, 
similar to the following (we can't actually use `Vec` as explained later)
   
   ```
   HashMap<Vec<ScalarValue>, GroupByState>
   ```
   
   And then use the hashbrown API to avoid creating `Vec<ScalarValue>` values 
on each row to lookup the correct entry.
   
   I made a cut down example showing how this works here: 
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3c5e4356d83b3a5e2221dc82eb3f4ebf.
   
   
   In order to use this approach, the following changes are  needed, as shown 
in the example:
   
   1. Change `create_hashes` to be consistent with the `impl std::hash::Hash` 
for ScalarValue  (more details below)
   2. Add a function to compare the values of `&[ArrayRef]` at a `row` with the 
values of `&[ScalarValue]` without creating `ScalarValues`
   3. Add a wrapper around [ScalarValue] so that we can ensure the hash value 
is consistent with the hash value created from the arrays.
   
   The second two are straightforward. The first may require some discussion
   
   
   ## Consistent Hashing
   ### Ratonale
   
   Sadly, for some reason even when using the `with_hash` HashBrown API, the 
hash value of the key is computed with `std::hash::Hash`. This means the 
implementation of `create_hashes` must be consistent with `ScalarValue::hash`.
   
   You can see the problem by switching the example to `create_hash_existing` 
in the example (which mirrors how create_hash works today) in which case the 
matching key is not found in the hash table and the group for `(1, 10)` is not 
aggregated
   
   ### Changes needed
   To illustrate the change that is needed, here is a highly simplified version 
of how `create_hash` is implemented today
   ```
   fn create_hash_existing(arrays: &[Array], row: usize, random_state: 
&RandomState) -> u64 {
       let mut hash: u64 = 0;
       for array in arrays {
           let row_hash = u64::get_hash(&array.values[row], random_state);
           hash = combine_hashes(hash, row_hash);
       }
       hash
   }
   ```
   
   Here is how a consistent version that uses the `std::hash::Hasher` API is 
implemented
   ```
   fn create_hash_working(arrays: &[Array], row: usize, random_state: 
&RandomState) -> u64 {
       let mut hasher = random_state.build_hasher();
       for array in arrays {
           hasher.write_u64(array.values[row]);
       }
       hasher.finish()
   }
   ```
   
   We also need to sort out how to hash null values, which are handled 
differently in create_hashes (not hashed) and ScalarValue (hashed). 
@jorgecarleitao has some good thoughts on the topic of hashing nulls here: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to