[jira] [Commented] (FLINK-3664) Create a method to easily Summarize a DataSet

Todd Lisonbee (JIRA) Thu, 24 Mar 2016 08:56:10 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210409#comment-15210409
 ]


Todd Lisonbee commented on FLINK-3664:
--------------------------------------

Hi Fabian, thanks for the feedback.

Your first 3 comments all make sense - agreed.

On distinct counts, I thought about it but wasn't sure so I left it out for 
now.  For an approximate, the best idea I had was to choose some arbitrary 
number, maybe 100.  And then just report the exact number of distinct values if 
less than 100, or to say 100+ if greater than 100.  This would be nice for 
categorical variables that happen to have less than 100 different values.  But 
with enough rows and columns it could be expensive (even if Tuple is currently 
limited to 22) or at least relatively more expensive than the other 
calculations.  There isn't a perfect magic number.  I didn't like this idea all 
of the way.

Do you know of a nice way to approximate distinct counts?

Thanks.

> Create a method to easily Summarize a DataSet
> ---------------------------------------------
>
>                 Key: FLINK-3664
>                 URL: https://issues.apache.org/jira/browse/FLINK-3664
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>         Attachments: DataSet-Summary-Design-March2016-v1.txt
>
>
> Here is an example:
> {code}
> /**
>  * Summarize a DataSet of Tuples by collecting single pass statistics for all 
> columns
>  */
> public Tuple summarize()
> Dataset<Tuple3<Double, String, Boolean>> input = // [...]
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary 
> = input.summarize()
> summary.getField(0).stddev()
> summary.getField(1).maxStringLength()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3664) Create a method to easily Summarize a DataSet

Reply via email to