jhorstmann commented on issue #790:
URL:
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892819030
Thanks, great writeup and nice example. Consistent hashes between Arrays and
Scalar sound very nice but I think will require an extensive test suite for all
data types so it doesn't get broken accidentally.
An alternative could be to store the calculated hash also in the `GroupKey`,
but that would increase the size of each key and decrease cache locality of the
map. Rehashing would become faster, but that operation would ideally be quite
rare.
For reference, my experiments with grouping by mapping each input row to a
consecutive integer used hashbrown like this:
```
fn hash_to_group_indices(
columns: &[Column],
len: usize,
group_by: &[usize],
) -> (Vec<usize>, Vec<usize>) {
let mut hashes: Vec<u64> = vec![0; len];
hash(columns, group_by, hashes.as_mut_slice());
// row -> group idx
let mut map: HashMap<usize, usize> =
HashMap::with_capacity(DEFAULT_MAP_CAPACITY);
// index of the group that each row belongs to
let indices: Vec<usize> = hashes.iter().enumerate().map(|(i, hash)| {
let next = map.len();
match map.raw_entry_mut().from_hash(*hash, |j| columns_eq(columns,
group_by, i, *j)) {
RawEntryMut::Occupied(entry) => *entry.get(),
RawEntryMut::Vacant(entry) => {
*entry.insert_with_hasher(*hash, i, next, |j| hashes[*j]).1
}
}
}).collect();
// index where each group first appeared in the input
let mut key_indices = map.keys().copied().collect::<Vec<usize>>();
key_indices.sort_unstable();
(indices, key_indices)
}
```
Since the hashmap key is just an index into the input columns, the
`insert_with_hasher` can look up the hash value based on that index. The hash
value for the new entry is provided as a parameter, the hasher closure is only
used for rehashing. The example requires access to the whole data though, this
unfortunately won't work when processing a batch at a time.
There is also a method `insert_hashed_no_check`, which allows specifying the
hash value for the new entry, but would still use the default hash
implementation for rehashing. That would remove one hash calculation in your
code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]