alamb edited a comment on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892767118
# Summary TLDR; If it is ok to change `create_hashes` to use `std::hash::Hasher` to be consistent with `ScalarValue::hash` I believe we can avoid "signatures" and hashbrown's `HashMap` directly. This will: 1. Handle hash collisions without any datafusion specific code 2. Hash key values once for each input row (and once for each new group key) 3. Copy key valyes once for each group # Discussion Based on the great suggestion from @jhorstmann, I spent some time playing with hashbrown's `from_hash`, to try and: 1. Avoid double hashing as mentioned by @rdettai https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888956349 2. Let hashbrown handle collisions, as mentioned by @jhorstmann: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-890995232 3. Avoid the storage overhead of storing a synthetic key / signature The approach was to create a `HashMap` keyed off a list of `ScalarValues`, similar to the following (we can't actually use `Vec` as explained later) ``` HashMap<Vec<ScalarValue>, GroupByState> ``` And then use the hashbrown API to avoid creating `Vec<ScalarValue>` values on each row to lookup the correct entry. I made a cut down example showing how this works here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3c5e4356d83b3a5e2221dc82eb3f4ebf. In order to use this approach, the following changes are needed, as shown in the example: 1. Change `create_hashes` to be consistent with the `impl std::hash::Hash` for ScalarValue (more details below) 2. Add a function to compare the values of `&[ArrayRef]` at a `row` with the values of `&[ScalarValue]` without creating `ScalarValues` 3. Add a wrapper around [ScalarValue] so that we can ensure the hash value is consistent with the hash value created from the arrays. The second two are straightforward. The first may require some discussion ## Consistent Hashing ### Ratonale Sadly, for some reason even when using the `with_hash` HashBrown API, the hash value of the key is computed with `std::hash::Hash`. This means the implementation of `create_hashes` must be consistent with `ScalarValue::hash`. You can see the problem by switching the example to `create_hash_existing` in the example (which mirrors how create_hash works today) in which case the matching key is not found in the hash table and the group for `(1, 10)` is not aggregated ### Changes needed To illustrate the change that is needed, here is a highly simplified version of how `create_hash` is implemented today ``` fn create_hash_existing(arrays: &[Array], row: usize, random_state: &RandomState) -> u64 { let mut hash: u64 = 0; for array in arrays { let row_hash = u64::get_hash(&array.values[row], random_state); hash = combine_hashes(hash, row_hash); } hash } ``` Here is how a consistent version that uses the `std::hash::Hasher` API is implemented ``` fn create_hash_working(arrays: &[Array], row: usize, random_state: &RandomState) -> u64 { let mut hasher = random_state.build_hasher(); for array in arrays { hasher.write_u64(array.values[row]); } hasher.finish() } ``` We also need to sort out how to hash null values, which are handled differently in create_hashes (not hashed) and ScalarValue (hashed). @jorgecarleitao has some good thoughts on the topic of hashing nulls here: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org