[GitHub] [arrow-datafusion] jhorstmann commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

GitBox Mon, 31 May 2021 04:24:14 -0700


jhorstmann commented on issue #418:
URL: 
https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851422139



   > I think what would be bes in the long run is building a mutable typed 
array based for the aggregation states, and keeping only the _offsets_ to that 
array in a hash table structure.
   
   I agree and also did some experiments around this the last weeks. The 
slowest part is unfortunately still that you'd need a hashmap to build this 
array of consecutive offsets. But this part is much easier to specialize for 
different key types or number of group by columns. Updating the accumulator 
states as in your code then trades random access into memory vs dynamic method 
calls and should be faster.
   
   I did not base those experiments on arrow or datafusion though, the approach 
in datafusion with collecting the row indices and using `take` + `batch_update` 
could be better for small cardinalities but might otherwise have bigger 
overhead. The two approaches I compared where
   
    - a simple `HashMap<Vec<Scalar>, Vec<Box<dyn Accumulator>>>` that gets 
updated row by row and afterwards iterated for each group and aggregate column
    - a `HashMap<Vec<Scalar>, usize>`, the value being a sequential offset, at 
the same time collection the keys into a `Vec<Vec<Scalar>>` that corresponds to 
those offsets. Afterwards doing basically the same as in your example code, 
only requiring a dispatch on the array type once.
   
   In a microbenchmark with random data and grouping by two u32 columns (about 
1000 different groups), ignoring any nulls, the second one was about 2x faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jhorstmann commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Reply via email to