[
https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212552#comment-15212552
]
Fabian Hueske commented on FLINK-3613:
--------------------------------------
Hi Todd,
thanks for the detailed proposal and analysis of the shortcomings of the
current implementation.
Here are a few comments on the proposal:
- We recently designed a new aggregations interface for the Table API which
looks very much like yours. See FLINK-3473 and FLINK-3474
- As you observed, the current API is limited because the return type is
identical to the input type. This means that only type preserving aggregations
are possible and each field can only be aggregated once.
- There is also a *very* old PR that aimed to add support for out-of-place
aggregations: https://github.com/apache/flink/pull/243
- If we allow for out-of-place aggregations and non-type preserving aggregation
methods, the Java compiler won't be able to infer the type of the result data
set. We have the same problem with the {{DataSet.project()}} method.
- I would go for option 2, i.e., implement a new API and deprecate the current
one.
- Because this touches/deprecates stable APIs and because of the result type
inference issue, we should get community consensus before the actual
implementation is started.
> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
> Key: FLINK-3613
> URL: https://issues.apache.org/jira/browse/FLINK-3613
> Project: Flink
> Issue Type: Improvement
> Reporter: Todd Lisonbee
> Priority: Minor
> Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for
> org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et
> al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces
> the numerical errors that occur when adding a sequence of finite precision
> floating point numbers. Numerical errors arise due to truncation and
> rounding. These errors can lead to numerical instability when calculating
> variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)