[GitHub] [arrow-datafusion] jhorstmann commented on issue #790: Rework GroupByHash to support grouping by nulls

GitBox Wed, 28 Jul 2021 13:14:58 -0700


jhorstmann commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888591358



   One thing I'd like to benchmark is the storing of indices as part of the map 
values. I see that this can make updating the accumulators more efficient since 
we can use `update_batch` instead of `update(ScalarValue)`, but collecting 
those indices, using the `take` kernel and `array.slice` also adds some 
additional overhead, especially with a larger number of groups. So I'm 
wondering whether directly updating the accumulators row by row would be much 
simpler and also not much slower.
   
   The overhead in that approach would be in constructing and matching 
`ScalarValue`. All Accumulators except DistinctCount only work with numbers as 
input, so in most cases that wouldn't require allocation. Overhead could could 
maybe further be reduced by creating typed accumulators (via a macro). Instead 
of a SumAccumulator that has to match on all possible variants there could be 
separate Float64Sum, Int64Sum, UInt64Sum and so on. Not sure how much benefit 
that would bring though.
   
   I have been experimenting with another completely different approach since 
some time, although with a much simplified type system and also without 
considering null values yet. And it would require access to the whole input 
data and might be difficult to adapt to work with batches.
   
   The main idea is to use a hashmap that only maps input row numbers to 
consecutive integers that identify which group that row belongs to. So for an 
input like `[a, b, b, c, b, a]` these indices would be `[0, 1, 1, 2, 1, 0]`. We 
also keep track of the indices where each group first appeared in the input, 
that would be `[0, 1, 3]` in this example, and this can then be used to create 
the arrays for the group by columns. In this example there is only one group by 
column, but it also works with multiple. Ideally the hash values for the group 
by arrays would be calculated up front in a vectorized way, the biggest 
overhead then is the `equals` function that has too look up two indices in the 
group by arrays and needs to compare their values.
   
   Once the indices are calculated as described above, we can do the 
aggregation. For each aggregate there is a generically typed implementation 
that takes the input array for that aggregation and the indices as parameter. 
It created a typed Vec of Accumulators, the length corresponding to the number 
of groups, and then iterators through indices and inputs and updates the 
accumulator at that index. This does a lot of random access, but there are no 
runtime typechecks or dynamic dispatch needed inside the inner loop.
   
   Another benefit of the approach is that there can be separate optimized 
versions for the indices part, for example when grouping by a single dictionary 
encoded array.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jhorstmann commented on issue #790: Rework GroupByHash to support grouping by nulls

Reply via email to