[
https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212508#comment-15212508
]
Fabian Hueske commented on FLINK-3664:
--------------------------------------
The number of distinct values can be approximated with the HyperLogLog
algorithm.
I think a first version is fine without distinct value counts though. The
approach sketched in the design doc looks quite extensible so that distinct
counts and possibly other metrics can be added later.
> Create a method to easily Summarize a DataSet
> ---------------------------------------------
>
> Key: FLINK-3664
> URL: https://issues.apache.org/jira/browse/FLINK-3664
> Project: Flink
> Issue Type: Improvement
> Reporter: Todd Lisonbee
> Attachments: DataSet-Summary-Design-March2016-v1.txt
>
>
> Here is an example:
> {code}
> /**
> * Summarize a DataSet of Tuples by collecting single pass statistics for all
> columns
> */
> public Tuple summarize()
> Dataset<Tuple3<Double, String, Boolean>> input = // [...]
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary
> = input.summarize()
> summary.getField(0).stddev()
> summary.getField(1).maxStringLength()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)