>
> Interestingly, spark uses count / N
> <https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
>  to
> compute the average, not an online algorithm.

Yes, it looks like the actual Spark SQL code is at [1] though.  Spark
doesn't seem use the naive algorithm for Std. Dev. [2]


[1]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
[2]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L105

On Thu, Sep 17, 2020 at 8:56 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> > I think what everyone else was potentially  stating implicitly is that
> for
> combining details about arrays, for std. dev. and average there needs to be
> more state kept that is different from the elements that one is actually
> dealing with.  For std. dev.  you need to keep two numbers (same with
> average).
>
> This; Irrespectively of which algorithm we compute an aggregate with, the
> core idea is that we split the calculation in batches, and we need to be
> able to reduce each batch to a set of states (e.g. N, count), so that we
> can reduce these states to a single state, which can be used to compute the
> final result.
>
> Also related: https://issues.apache.org/jira/browse/ARROW-9779
>
> Interestingly, spark uses count / N
> <https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
> to compute the average, not an online algorithm.
>
> Best,
> Jorge
>
>
> On Fri, Sep 18, 2020 at 5:42 AM Andrew Wieteska <
> andrew.r.wiete...@gmail.com> wrote:
>
>> Dear all
>>
>> I'm not sure I'm thinking about this right, but if we're looking to
>> leverage vectorization for standard deviation/variance would it make sense
>> to compute the sum, the sum of squares, and the total number of data (N)
>> over all chunks and compute the actual function,
>>
>> stdev = sqrt(sum_squares/N - (sum/N)^2)
>>
>> only once at the end? This is one of the approaches in [1].
>>
>> Best wishes
>> Andrew
>>
>> On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>> > >
>> > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
>> >
>> >
>> > This is not numerically stable. Please do not use it.  Please see [1]
>> for
>> > some algorithms that might be better.
>> >
>> > The equation you provided is great in practice to calculate stdev for
>> one
>> > > array. It doesn't address the issue of combining stdev from multiple
>> > arrays.
>> >
>> >
>> > I think what everyone else was potentially  stating implicitly is that
>> for
>> > combining details about arrays, for std. dev. and average there needs
>> to be
>> > more state kept that is different from the elements that one is actually
>> > dealing with.  For std. dev.  you need to keep two numbers (same with
>> > average).
>> >
>> > For percentiles, I think calculating exactly will require quite a large
>> > state (for integers a histogram approach could be used to compress
>> this).
>> > There are however some good approximation algorithms that can be used if
>> > exact values are not necessary (for example t-digest [2]).  At some
>> point
>> > Arrow should probably have both.
>> >
>> > [1]
>> >
>> >
>> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
>> > [2] https://github.com/tdunning/t-digest
>> >
>> > On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:
>> >
>> > > Thanks Andrew. The link gives a cool method to calculate variance
>> > > incrementally. I think the problem is that it's computationally too
>> > > expensive (cannot leverage vectorization, three divisions for a single
>> > data
>> > > point).
>> > > The equation you provided is great in practice to calculate stdev for
>> one
>> > > array. It doesn't address the issue of combining stdev from multiple
>> > arrays.
>> > >
>> > > On 9/16/20 6:25 PM, Andrew Lamb wrote:
>> > > > Perhaps you can rewrite the functions in terms of other kernels that
>> > can
>> > > be
>> > > > merged -- for example something like the following
>> > > >
>> > > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) /
>> count(x))/(count(x)-1)))
>> > > >
>> > > > (loosely translated from
>> > > >
>> > >
>> >
>> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
>> > > > )
>> > > >
>> > > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> I have a question about aggregate kernel implementation. Any help
>> is
>> > > >> appreciated.
>> > > >>
>> > > >> Aggregate kernel implements "consume" and "merge" interfaces. For a
>> > > >> chunked array, "consume" is called for each array to get a
>> temporary
>> > > >> aggregated result, then "merge" it with previously consumed result.
>> > For
>> > > >> associative operations like min/max/sum, this pattern is
>> convenient.
>> > We
>> > > can
>> > > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a,
>> > array_b]) =
>> > > >> sum(array_a) + sum(array_b).
>> > > >>
>> > > >> But I wonder what's the best approach to deal with operations like
>> > > >> stdev/percentile. Results of these operations cannot be easily
>> > > "merged". We
>> > > >> have to walk through all the chunks to get the result. For these
>> > > >> operations, looks "consume" must copy the input array and do all
>> > > >> calculation once at "finalize" time. Or we don't expect it to
>> support
>> > > >> chunked array for them.
>> > > >>
>> > > >> Yibo
>> > > >>
>> > > >
>> > >
>> >
>>
>

Reply via email to