> > Interestingly, spark uses count / N > <https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83> > to > compute the average, not an online algorithm.
Yes, it looks like the actual Spark SQL code is at [1] though. Spark doesn't seem use the naive algorithm for Std. Dev. [2] [1] https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala [2] https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L105 On Thu, Sep 17, 2020 at 8:56 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > > I think what everyone else was potentially stating implicitly is that > for > combining details about arrays, for std. dev. and average there needs to be > more state kept that is different from the elements that one is actually > dealing with. For std. dev. you need to keep two numbers (same with > average). > > This; Irrespectively of which algorithm we compute an aggregate with, the > core idea is that we split the calculation in batches, and we need to be > able to reduce each batch to a set of states (e.g. N, count), so that we > can reduce these states to a single state, which can be used to compute the > final result. > > Also related: https://issues.apache.org/jira/browse/ARROW-9779 > > Interestingly, spark uses count / N > <https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83> > to compute the average, not an online algorithm. > > Best, > Jorge > > > On Fri, Sep 18, 2020 at 5:42 AM Andrew Wieteska < > andrew.r.wiete...@gmail.com> wrote: > >> Dear all >> >> I'm not sure I'm thinking about this right, but if we're looking to >> leverage vectorization for standard deviation/variance would it make sense >> to compute the sum, the sum of squares, and the total number of data (N) >> over all chunks and compute the actual function, >> >> stdev = sqrt(sum_squares/N - (sum/N)^2) >> >> only once at the end? This is one of the approaches in [1]. >> >> Best wishes >> Andrew >> >> On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> > > >> > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1))) >> > >> > >> > This is not numerically stable. Please do not use it. Please see [1] >> for >> > some algorithms that might be better. >> > >> > The equation you provided is great in practice to calculate stdev for >> one >> > > array. It doesn't address the issue of combining stdev from multiple >> > arrays. >> > >> > >> > I think what everyone else was potentially stating implicitly is that >> for >> > combining details about arrays, for std. dev. and average there needs >> to be >> > more state kept that is different from the elements that one is actually >> > dealing with. For std. dev. you need to keep two numbers (same with >> > average). >> > >> > For percentiles, I think calculating exactly will require quite a large >> > state (for integers a histogram approach could be used to compress >> this). >> > There are however some good approximation algorithms that can be used if >> > exact values are not necessary (for example t-digest [2]). At some >> point >> > Arrow should probably have both. >> > >> > [1] >> > >> > >> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data >> > [2] https://github.com/tdunning/t-digest >> > >> > On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote: >> > >> > > Thanks Andrew. The link gives a cool method to calculate variance >> > > incrementally. I think the problem is that it's computationally too >> > > expensive (cannot leverage vectorization, three divisions for a single >> > data >> > > point). >> > > The equation you provided is great in practice to calculate stdev for >> one >> > > array. It doesn't address the issue of combining stdev from multiple >> > arrays. >> > > >> > > On 9/16/20 6:25 PM, Andrew Lamb wrote: >> > > > Perhaps you can rewrite the functions in terms of other kernels that >> > can >> > > be >> > > > merged -- for example something like the following >> > > > >> > > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / >> count(x))/(count(x)-1))) >> > > > >> > > > (loosely translated from >> > > > >> > > >> > >> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation >> > > > ) >> > > > >> > > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote: >> > > > >> > > >> Hi, >> > > >> >> > > >> I have a question about aggregate kernel implementation. Any help >> is >> > > >> appreciated. >> > > >> >> > > >> Aggregate kernel implements "consume" and "merge" interfaces. For a >> > > >> chunked array, "consume" is called for each array to get a >> temporary >> > > >> aggregated result, then "merge" it with previously consumed result. >> > For >> > > >> associative operations like min/max/sum, this pattern is >> convenient. >> > We >> > > can >> > > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a, >> > array_b]) = >> > > >> sum(array_a) + sum(array_b). >> > > >> >> > > >> But I wonder what's the best approach to deal with operations like >> > > >> stdev/percentile. Results of these operations cannot be easily >> > > "merged". We >> > > >> have to walk through all the chunks to get the result. For these >> > > >> operations, looks "consume" must copy the input array and do all >> > > >> calculation once at "finalize" time. Or we don't expect it to >> support >> > > >> chunked array for them. >> > > >> >> > > >> Yibo >> > > >> >> > > > >> > > >> > >> >