[jira] [Commented] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations

Todd Lisonbee (JIRA) Sat, 19 Mar 2016 06:18:19 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200562#comment-15200562
 ]


Todd Lisonbee commented on FLINK-3613:
--------------------------------------

I didn't find exact overlap (FLINK-2144 was similar except for Windows, 
FLINK-2379 is for vectors but isn't using above interface).

---

Implementing this isn't as easy as extending the existing AggregationFunction 
abstract class.  AggregationFunction works for Sum, Min, and Max but isn't 
general enough for other aggregations.

An aggregation should have three types:
1) the value type - the type being aggregated
2) the aggregate type - the intermediate type that carries all needed data for 
the aggregation
3) the result type - the result of the aggregation

For example, if you are aggregating doubles in different ways:
SUM - value type is double, aggregation type is double, result type is double
COUNT - value type is double, aggregation type is probably long, result type is 
long
STANDARD_DEVIATION - value type is double, aggregation type would be a complex 
type (count, mean, sum of squares differences from current mean, deltas), 
result type is double

> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>
> Implement standard deviation, mean, variance for 
> org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et 
> al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces 
> the numerical errors that occur when adding a sequence of finite precision 
> floating point numbers. Numerical errors arise due to truncation and 
> rounding. These errors can lead to numerical instability when calculating 
> variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations

Reply via email to