Appreciate the helps from everyone. Looks arrow c++ aggregate kernel already 
addressed my problem of combining states from batches. Thanks.

On 9/18/20 12:08 PM, Micah Kornfield wrote:

Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
 to
compute the average, not an online algorithm.

Yes, it looks like the actual Spark SQL code is at [1] though.  Spark
doesn't seem use the naive algorithm for Std. Dev. [2]


[1]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
[2]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L105

On Thu, Sep 17, 2020 at 8:56 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

Hi,

I think what everyone else was potentially  stating implicitly is that
for
combining details about arrays, for std. dev. and average there needs to be
more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

This; Irrespectively of which algorithm we compute an aggregate with, the
core idea is that we split the calculation in batches, and we need to be
able to reduce each batch to a set of states (e.g. N, count), so that we
can reduce these states to a single state, which can be used to compute the
final result.

Also related: https://issues.apache.org/jira/browse/ARROW-9779

Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
to compute the average, not an online algorithm.

Best,
Jorge


On Fri, Sep 18, 2020 at 5:42 AM Andrew Wieteska <
andrew.r.wiete...@gmail.com> wrote:

Dear all

I'm not sure I'm thinking about this right, but if we're looking to
leverage vectorization for standard deviation/variance would it make sense
to compute the sum, the sum of squares, and the total number of data (N)
over all chunks and compute the actual function,

stdev = sqrt(sum_squares/N - (sum/N)^2)

only once at the end? This is one of the approaches in [1].

Best wishes
Andrew

On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:


stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))


This is not numerically stable. Please do not use it.  Please see [1]
for
some algorithms that might be better.

The equation you provided is great in practice to calculate stdev for
one
array. It doesn't address the issue of combining stdev from multiple
arrays.


I think what everyone else was potentially  stating implicitly is that
for
combining details about arrays, for std. dev. and average there needs
to be
more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

For percentiles, I think calculating exactly will require quite a large
state (for integers a histogram approach could be used to compress
this).
There are however some good approximation algorithms that can be used if
exact values are not necessary (for example t-digest [2]).  At some
point
Arrow should probably have both.

[1]


https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
[2] https://github.com/tdunning/t-digest

On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:

Thanks Andrew. The link gives a cool method to calculate variance
incrementally. I think the problem is that it's computationally too
expensive (cannot leverage vectorization, three divisions for a single
data
point).
The equation you provided is great in practice to calculate stdev for
one
array. It doesn't address the issue of combining stdev from multiple
arrays.

On 9/16/20 6:25 PM, Andrew Lamb wrote:
Perhaps you can rewrite the functions in terms of other kernels that
can
be
merged -- for example something like the following

stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) /
count(x))/(count(x)-1)))

(loosely translated from



https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
)

On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:

Hi,

I have a question about aggregate kernel implementation. Any help
is
appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a
chunked array, "consume" is called for each array to get a
temporary
aggregated result, then "merge" it with previously consumed result.
For
associative operations like min/max/sum, this pattern is
convenient.
We
can
easily "merge" min/max/sum of two arrays, e.g, sum([array_a,
array_b]) =
sum(array_a) + sum(array_b).

But I wonder what's the best approach to deal with operations like
stdev/percentile. Results of these operations cannot be easily
"merged". We
have to walk through all the chunks to get the result. For these
operations, looks "consume" must copy the input array and do all
calculation once at "finalize" time. Or we don't expect it to
support
chunked array for them.

Yibo







Reply via email to