Thanks Jorge, Wes, Will study the links and try to propose improvements of c++ aggregate function.
On 9/16/20 11:17 PM, Wes McKinney wrote:
Perhaps it would be helpful to look at how Clickhouse's aggregate functions are implemented? https://github.com/ClickHouse/ClickHouse/tree/master/src/AggregateFunctions You're welcome to propose improvements to the kernel interface to accommodate more complex aggregate functions On Wed, Sep 16, 2020 at 5:27 AM Jorge Cardoso Leitão <[email protected]> wrote:Hi Yibo, That is correct. The simplest example is an average of 3 elements {x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is not equal to the average: avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) / 3 = avg({x1,x2,x3}) We are solving this in DataFusion (Rust): Issue: https://issues.apache.org/jira/browse/ARROW-9937 Proposal: https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit PR https://github.com/apache/arrow/pull/8172, Essentially, what you said, there needs to be two operations, `update` and `merge`, which are not the same. There are many other examples, and that is also why e.g. spark's UDAF's interface requires `update` and `merge` On Wed, Sep 16, 2020 at 12:12 PM Yibo Cai <[email protected]> wrote:Hi, I have a question about aggregate kernel implementation. Any help is appreciated. Aggregate kernel implements "consume" and "merge" interfaces. For a chunked array, "consume" is called for each array to get a temporary aggregated result, then "merge" it with previously consumed result. For associative operations like min/max/sum, this pattern is convenient. We can easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) = sum(array_a) + sum(array_b). But I wonder what's the best approach to deal with operations like stdev/percentile. Results of these operations cannot be easily "merged". We have to walk through all the chunks to get the result. For these operations, looks "consume" must copy the input array and do all calculation once at "finalize" time. Or we don't expect it to support chunked array for them. Yibo
