Hi Yibo, That is correct. The simplest example is an average of 3 elements {x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is not equal to the average:
avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) / 3 = avg({x1,x2,x3}) We are solving this in DataFusion (Rust): Issue: https://issues.apache.org/jira/browse/ARROW-9937 Proposal: https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit PR https://github.com/apache/arrow/pull/8172, Essentially, what you said, there needs to be two operations, `update` and `merge`, which are not the same. There are many other examples, and that is also why e.g. spark's UDAF's interface requires `update` and `merge` On Wed, Sep 16, 2020 at 12:12 PM Yibo Cai <yibo....@arm.com> wrote: > Hi, > > I have a question about aggregate kernel implementation. Any help is > appreciated. > > Aggregate kernel implements "consume" and "merge" interfaces. For a > chunked array, "consume" is called for each array to get a temporary > aggregated result, then "merge" it with previously consumed result. For > associative operations like min/max/sum, this pattern is convenient. We can > easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) = > sum(array_a) + sum(array_b). > > But I wonder what's the best approach to deal with operations like > stdev/percentile. Results of these operations cannot be easily "merged". We > have to walk through all the chunks to get the result. For these > operations, looks "consume" must copy the input array and do all > calculation once at "finalize" time. Or we don't expect it to support > chunked array for them. > > Yibo >