[GitHub] [arrow-datafusion] jhorstmann commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Wed, 04 Aug 2021 09:58:25 -0700


jhorstmann commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892819030



   Thanks, great writeup and nice example. Consistent hashes between Arrays and 
Scalar sound very nice but I think will require an extensive test suite for all 
data types so it doesn't get broken accidentally.
   
   An alternative could be to store the calculated hash also in the `GroupKey`, 
but that would increase the size of each key and decrease cache locality of the 
map. Rehashing would become faster, but that operation would ideally be quite 
rare.
   
   For reference, my experiments with grouping by mapping each input row to a 
consecutive integer used hashbrown like this:
   
   ```
   fn hash_to_group_indices(
       columns: &[Column],
       len: usize,
       group_by: &[usize],
   ) -> (Vec<usize>, Vec<usize>) {
       let mut hashes: Vec<u64> = vec![0; len];
       hash(columns, group_by, hashes.as_mut_slice());
   
       // row -> group idx
       let mut map: HashMap<usize, usize> = 
HashMap::with_capacity(DEFAULT_MAP_CAPACITY);
   
       // index of the group that each row belongs to
       let indices: Vec<usize> = hashes.iter().enumerate().map(|(i,  hash)| {
           let next = map.len();
   
           match map.raw_entry_mut().from_hash(*hash, |j| columns_eq(columns, 
group_by, i, *j)) {
               RawEntryMut::Occupied(entry) => *entry.get(),
               RawEntryMut::Vacant(entry) => {
                   *entry.insert_with_hasher(*hash, i, next, |j| hashes[*j]).1
               }
           }
       }).collect();
   
       // index where each group first appeared in the input 
       let mut key_indices = map.keys().copied().collect::<Vec<usize>>();
       key_indices.sort_unstable();
   
       (indices, key_indices)
   }
   ```
   
   Since the hashmap key is just an index into the input columns, the 
`insert_with_hasher` can look up the hash value based on that index. The hash 
value for the new entry is provided as a parameter, the hasher closure is only 
used for rehashing. The example requires access to the whole data though, this 
unfortunately won't work when processing a batch at a time.
   
   There is also a method `insert_hashed_no_check`, which allows specifying the 
hash value for the new entry, but would still use the default hash 
implementation for rehashing. That would remove one hash calculation in your 
code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jhorstmann commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to