[ 
https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224259#comment-15224259
 ] 

Stephan Ewen edited comment on FLINK-3613 at 4/4/16 2:42 PM:
-------------------------------------------------------------

The design of the extended aggregators makes a lot of sense. I agree with 
Fabian that we should discuss two things first, however:

  1. Do we want such extended aggregations in the DataSet API, or basically 
push people to use the Table API instead? My gut feeling is that it makes sense 
to have this in the DataSet API if we answer (2) with "yes" have a good design 
for (3).
  2. I assume it should allow to use multiple aggregation functions, such that 
one could create something like {{(a, b) --> (max(a), min(a), avg(b))}}
  3. How do we want the signatures for this to look? Ideally making this 
typesafe via a builder (similar to the CSV input on ExecutionEnvironment).



was (Author: stephanewen):
The design of the extended aggregators makes a lot of sense. I agree with 
Fabian that we should discuss two things first, however:

  1. Do we want such extended aggregations in the DataSet API, or basically 
push people to use the Table API instead? My gut feeling is that it makes sense 
to have this in the DataSet API if we answer (2) with "yes" have a good design 
for (3).
  2. I assume it should allow to use multiple aggregation functions, such that 
one could create something {{like (a, b) --> (max(a), min(a), avg(b))}}
  3. How do we want the signatures for this to look? Ideally making this 
typesafe via a builder (similar to the CSV input on ExecutionEnvironment).


> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for 
> org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et 
> al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces 
> the numerical errors that occur when adding a sequence of finite precision 
> floating point numbers. Numerical errors arise due to truncation and 
> rounding. These errors can lead to numerical instability when calculating 
> variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to