Re: [C++][Compute] question about aggregate kernels

Yibo Cai Sun, 20 Sep 2020 23:14:59 -0700

Appreciate the helps from everyone. Looks arrow c++ aggregate kernel already 
addressed my problem of combining states from batches. Thanks.


On 9/18/20 12:08 PM, Micah Kornfield wrote:


Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
 to
compute the average, not an online algorithm.


Yes, it looks like the actual Spark SQL code is at [1] though.  Spark
doesn't seem use the naive algorithm for Std. Dev. [2]


[1]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
[2]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L105

On Thu, Sep 17, 2020 at 8:56 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

Hi,

I think what everyone else was potentially  stating implicitly is that

for
combining details about arrays, for std. dev. and average there needs to be
more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

This; Irrespectively of which algorithm we compute an aggregate with, the
core idea is that we split the calculation in batches, and we need to be
able to reduce each batch to a set of states (e.g. N, count), so that we
can reduce these states to a single state, which can be used to compute the
final result.

Also related: https://issues.apache.org/jira/browse/ARROW-9779

Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
to compute the average, not an online algorithm.

Best,
Jorge


On Fri, Sep 18, 2020 at 5:42 AM Andrew Wieteska <
andrew.r.wiete...@gmail.com> wrote:

Dear all

I'm not sure I'm thinking about this right, but if we're looking to
leverage vectorization for standard deviation/variance would it make sense
to compute the sum, the sum of squares, and the total number of data (N)
over all chunks and compute the actual function,

stdev = sqrt(sum_squares/N - (sum/N)^2)

only once at the end? This is one of the approaches in [1].

Best wishes
Andrew

On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:


stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))



This is not numerically stable. Please do not use it.  Please see [1]

for

some algorithms that might be better.

The equation you provided is great in practice to calculate stdev for

one

array. It doesn't address the issue of combining stdev from multiple

arrays.


I think what everyone else was potentially  stating implicitly is that

for

combining details about arrays, for std. dev. and average there needs

to be

more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

For percentiles, I think calculating exactly will require quite a large
state (for integers a histogram approach could be used to compress

this).

There are however some good approximation algorithms that can be used if
exact values are not necessary (for example t-digest [2]).  At some

point

Arrow should probably have both.

[1]

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data

[2] https://github.com/tdunning/t-digest

On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:

Thanks Andrew. The link gives a cool method to calculate variance
incrementally. I think the problem is that it's computationally too
expensive (cannot leverage vectorization, three divisions for a single

data

point).
The equation you provided is great in practice to calculate stdev for

one

array. It doesn't address the issue of combining stdev from multiple

arrays.


On 9/16/20 6:25 PM, Andrew Lamb wrote:

Perhaps you can rewrite the functions in terms of other kernels that

can

be

merged -- for example something like the following

stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) /

count(x))/(count(x)-1)))


(loosely translated from

https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation

)

On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:

Hi,

I have a question about aggregate kernel implementation. Any help

is

appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a
chunked array, "consume" is called for each array to get a

temporary

aggregated result, then "merge" it with previously consumed result.

For

associative operations like min/max/sum, this pattern is

convenient.

We

can

easily "merge" min/max/sum of two arrays, e.g, sum([array_a,

array_b]) =

sum(array_a) + sum(array_b).

But I wonder what's the best approach to deal with operations like
stdev/percentile. Results of these operations cannot be easily

"merged". We

have to walk through all the chunks to get the result. For these
operations, looks "consume" must copy the input array and do all
calculation once at "finalize" time. Or we don't expect it to

support

chunked array for them.

Yibo

Re: [C++][Compute] question about aggregate kernels

Reply via email to