Dandandan commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888229063
> > we can keep the keys in a contiguous (mutable) array instead and keep
offsets/pointers to the values in this array (and null values can be stored in
a bitmap, so only 1 bit per value).
>
> @Dandandan this is a great idea - I will write up in more detail what I
think you are proposing to make sure we are on the same page
Some more details to get you started, this is the rough idea that is a
similar to how hash join currently works (there it's a bit easier as we can
concatenate the data upfront):
* We can store both group by keys and aggregate values in typed / mutable
Arrays. This means the memory overhead is kept to a minimum and is much more
vectorization / cache friendly.
E.g. for a count grouped on u64 values, the state we keep for the group by
values is `MutableArray<U64Type>` and for the count state another
`MutableArray<U64Type>` (no bitmap needed here). The group by values can have
null values which are stored in a bitmap.
* A hashmap that only contains hashes / pointers: `Hashmap<u64,
SmallVec<u64>`
- first `u64` is the hash of the group by values -> this can be
calculated in a vectorized manner for the entire batch
- second `u64` is the offset in both the group by value(s) and aggregate
state arrays -> inserted to the mutable arrays when there is no match
* hash collisions (group by values mapping to same `u64` should/can be
handled by comparing values in the arrays at the group by offset and scanning
the `SmallVec`
Above structure will reduce the memory needed for the state, and will also
reduce the time to create / (re)hash the keys and will reduce the amount of
intermediate vec / small allocations that are needed in the hash aggregate code.
There are maybe slightly different ways to encode the data in the hashmap /
check collisions, and above structure makes some further optimizations /
vectorization possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]