[ 
https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227168#comment-15227168
 ] 

ASF GitHub Bot commented on FLINK-3664:
---------------------------------------

GitHub user tlisonbee opened a pull request:

    https://github.com/apache/flink/pull/1855

    [FLINK-3664] Create method to easily summarize a DataSet of Tuples

    Adding summarize() method in DataSetUtils that will supply a number of 
single pass statistics for DataSets of Tuples.
    
    Summary statistics depend on the type being summarized:
    
    - Numeric types (Integer, IntValue, Float, Double, etc): min, max, mean, 
variance, standard deviation, NaN count, Infinity count, totalCount, etc.
    - String, StringValue: minLength, maxLength, meanLength, emptyCount, 
totalCount
    - Boolean, BooleanValue: trueCount, falseCount, totalCount.
    
    Example usage:
    `Dataset<Tuple3<Double, String, Boolean>> input = // [...]`
    `Tuple3<NumericColumnSummary,StringColumnSummary, BooleanColumnSummary> 
summary = DataSetUtils.summarize(input)`
    
    `summary.f0.getStandardDeviation()`
    `summary.f1.getMaxLength()`
    
    Uses the Kahan summation algorithm to avoid numeric instability.  The 
algorithm is described in: "Scalable and Numerically Stable Descriptive 
Statistics in SystemML", Tian et al, International Conference on Data 
Engineering 2012.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tlisonbee/flink FLINK-3664

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1855.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1855
    
----
commit 65f54df532829994a8be240a27b9138d01a186b5
Author: Todd Lisonbee <[email protected]>
Date:   2016-04-05T05:51:12Z

    [FLINK-3664] Create DataSetUtils method to easily summarize a DataSet of 
Tuples

----


> Create a method to easily Summarize a DataSet
> ---------------------------------------------
>
>                 Key: FLINK-3664
>                 URL: https://issues.apache.org/jira/browse/FLINK-3664
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>         Attachments: DataSet-Summary-Design-March2016-v1.txt
>
>
> Here is an example:
> {code}
> /**
>  * Summarize a DataSet of Tuples by collecting single pass statistics for all 
> columns
>  */
> public Tuple summarize()
> Dataset<Tuple3<Double, String, Boolean>> input = // [...]
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary 
> = input.summarize()
> summary.getField(0).stddev()
> summary.getField(1).maxStringLength()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to