jhorstmann commented on issue #418:
URL:
https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851422139
> I think what would be bes in the long run is building a mutable typed
array based for the aggregation states, and keeping only the _offsets_ to that
array in a hash table structure.
I agree and also did some experiments around this the last weeks. The
slowest part is unfortunately still that you'd need a hashmap to build this
array of consecutive offsets. But this part is much easier to specialize for
different key types or number of group by columns. Updating the accumulator
states as in your code then trades random access into memory vs dynamic method
calls and should be faster.
I did not base those experiments on arrow or datafusion though, the approach
in datafusion with collecting the row indices and using `take` + `batch_update`
could be better for small cardinalities but might otherwise have bigger
overhead. The two approaches I compared where
- a simple `HashMap<Vec<Scalar>, Vec<Box<dyn Accumulator>>>` that gets
updated row by row and afterwards iterated for each group and aggregate column
- a `HashMap<Vec<Scalar>, usize>`, the value being a sequential offset, at
the same time collection the keys into a `Vec<Vec<Scalar>>` that corresponds to
those offsets. Afterwards doing basically the same as in your example code,
only requiring a dispatch on the array type once.
In a microbenchmark with random data and grouping by two u32 columns (about
1000 different groups), ignoring any nulls, the second one was about 2x faster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]