[
https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210409#comment-15210409
]
Todd Lisonbee commented on FLINK-3664:
--------------------------------------
Hi Fabian, thanks for the feedback.
Your first 3 comments all make sense - agreed.
On distinct counts, I thought about it but wasn't sure so I left it out for
now. For an approximate, the best idea I had was to choose some arbitrary
number, maybe 100. And then just report the exact number of distinct values if
less than 100, or to say 100+ if greater than 100. This would be nice for
categorical variables that happen to have less than 100 different values. But
with enough rows and columns it could be expensive (even if Tuple is currently
limited to 22) or at least relatively more expensive than the other
calculations. There isn't a perfect magic number. I didn't like this idea all
of the way.
Do you know of a nice way to approximate distinct counts?
Thanks.
> Create a method to easily Summarize a DataSet
> ---------------------------------------------
>
> Key: FLINK-3664
> URL: https://issues.apache.org/jira/browse/FLINK-3664
> Project: Flink
> Issue Type: Improvement
> Reporter: Todd Lisonbee
> Attachments: DataSet-Summary-Design-March2016-v1.txt
>
>
> Here is an example:
> {code}
> /**
> * Summarize a DataSet of Tuples by collecting single pass statistics for all
> columns
> */
> public Tuple summarize()
> Dataset<Tuple3<Double, String, Boolean>> input = // [...]
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary
> = input.summarize()
> summary.getField(0).stddev()
> summary.getField(1).maxStringLength()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)