jhorstmann commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888591358
One thing I'd like to benchmark is the storing of indices as part of the map values. I see that this can make updating the accumulators more efficient since we can use `update_batch` instead of `update(ScalarValue)`, but collecting those indices, using the `take` kernel and `array.slice` also adds some additional overhead, especially with a larger number of groups. So I'm wondering whether directly updating the accumulators row by row would be much simpler and also not much slower. The overhead in that approach would be in constructing and matching `ScalarValue`. All Accumulators except DistinctCount only work with numbers as input, so in most cases that wouldn't require allocation. Overhead could could maybe further be reduced by creating typed accumulators (via a macro). Instead of a SumAccumulator that has to match on all possible variants there could be separate Float64Sum, Int64Sum, UInt64Sum and so on. Not sure how much benefit that would bring though. I have been experimenting with another completely different approach since some time, although with a much simplified type system and also without considering null values yet. And it would require access to the whole input data and might be difficult to adapt to work with batches. The main idea is to use a hashmap that only maps input row numbers to consecutive integers that identify which group that row belongs to. So for an input like `[a, b, b, c, b, a]` these indices would be `[0, 1, 1, 2, 1, 0]`. We also keep track of the indices where each group first appeared in the input, that would be `[0, 1, 3]` in this example, and this can then be used to create the arrays for the group by columns. In this example there is only one group by column, but it also works with multiple. Ideally the hash values for the group by arrays would be calculated up front in a vectorized way, the biggest overhead then is the `equals` function that has too look up two indices in the group by arrays and needs to compare their values. Once the indices are calculated as described above, we can do the aggregation. For each aggregate there is a generically typed implementation that takes the input array for that aggregation and the indices as parameter. It created a typed Vec of Accumulators, the length corresponding to the number of groups, and then iterators through indices and inputs and updates the accumulator at that index. This does a lot of random access, but there are no runtime typechecks or dynamic dispatch needed inside the inner loop. Another benefit of the approach is that there can be separate optimized versions for the indices part, for example when grouping by a single dictionary encoded array. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org