Hi Yibo,

That is correct. The simplest example is an average of 3 elements
{x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is
not equal to the average:

avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) /
3 = avg({x1,x2,x3})

We are solving this in DataFusion (Rust):

Issue: https://issues.apache.org/jira/browse/ARROW-9937
Proposal:
https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit
PR https://github.com/apache/arrow/pull/8172,

Essentially, what you said, there needs to be two operations, `update` and
`merge`, which are not the same.
There are many other examples, and that is also why e.g. spark's UDAF's
interface requires `update` and `merge`




On Wed, Sep 16, 2020 at 12:12 PM Yibo Cai <yibo....@arm.com> wrote:

> Hi,
>
> I have a question about aggregate kernel implementation. Any help is
> appreciated.
>
> Aggregate kernel implements "consume" and "merge" interfaces. For a
> chunked array, "consume" is called for each array to get a temporary
> aggregated result, then "merge" it with previously consumed result. For
> associative operations like min/max/sum, this pattern is convenient. We can
> easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) =
> sum(array_a) + sum(array_b).
>
> But I wonder what's the best approach to deal with operations like
> stdev/percentile. Results of these operations cannot be easily "merged". We
> have to walk through all the chunks to get the result. For these
> operations, looks "consume" must copy the input array and do all
> calculation once at "finalize" time. Or we don't expect it to support
> chunked array for them.
>
> Yibo
>

Reply via email to