[ 
https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212552#comment-15212552
 ] 

Fabian Hueske commented on FLINK-3613:
--------------------------------------

Hi Todd,

thanks for the detailed proposal and analysis of the shortcomings of the 
current implementation. 

Here are a few comments on the proposal:
- We recently designed a new aggregations interface for the Table API which 
looks very much like yours. See FLINK-3473 and FLINK-3474
- As you observed, the current API is limited because the return type is 
identical to the input type. This means that only type preserving aggregations 
are possible and each field can only be aggregated once.
- There is also a *very* old PR that aimed to add support for out-of-place 
aggregations: https://github.com/apache/flink/pull/243
- If we allow for out-of-place aggregations and non-type preserving aggregation 
methods, the Java compiler won't be able to infer the type of the result data 
set. We have the same problem with the {{DataSet.project()}} method.
- I would go for option 2, i.e., implement a new API and deprecate the current 
one.
- Because this touches/deprecates stable APIs and because of the result type 
inference issue, we should get community consensus before the actual 
implementation is started.

> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for 
> org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et 
> al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces 
> the numerical errors that occur when adding a sequence of finite precision 
> floating point numbers. Numerical errors arise due to truncation and 
> rounding. These errors can lead to numerical instability when calculating 
> variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to