Dandandan commented on pull request #786: URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888229063
> > we can keep the keys in a contiguous (mutable) array instead and keep offsets/pointers to the values in this array (and null values can be stored in a bitmap, so only 1 bit per value). > > @Dandandan this is a great idea - I will write up in more detail what I think you are proposing to make sure we are on the same page Some more details to get you started, this is the rough idea that is a similar to how hash join currently works (there it's a bit easier as we can concatenate the data upfront): * We can store both group by keys and aggregate values in typed / mutable Arrays. This means the memory overhead is kept to a minimum and is much more vectorization / cache friendly. E.g. for a count grouped on u64 values, the state we keep for the group by values is `MutableArray<U64Type>` and for the count state another `MutableArray<U64Type>` (no bitmap needed here). The group by values can have null values which are stored in a bitmap. * A hashmap that only contains hashes / pointers: `Hashmap<u64, SmallVec<u64>` - first `u64` is the hash of the group by values -> this can be calculated in a vectorized manner for the entire batch - second `u64` is the offset in both the group by value(s) and aggregate state arrays -> inserted to the mutable arrays when there is no match * hash collisions (group by values mapping to same `u64` should/can be handled by comparing values in the arrays at the group by offset and scanning the `SmallVec` Above structure will reduce the memory needed for the state, and will also reduce the time to create / (re)hash the keys and will reduce the amount of intermediate vec / small allocations that are needed in the hash aggregate code. There are maybe slightly different ways to encode the data in the hashmap / check collisions, and above structure makes some further optimizations / vectorization possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org