[
https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626360#comment-14626360
]
ASF GitHub Bot commented on FLINK-2148:
---------------------------------------
GitHub user ggevay opened a pull request:
https://github.com/apache/flink/pull/910
[FLINK-2148] [contrib] Exact and approximate countDistinct on streams
For the approximate calculation I used the HyperLogLog implementation in
the Clearspring library.
Currently it operates only on the entire stream, but when the dust settles
around the windowing rewrite, I will modify this code to work on windows
instead. The Clearspring implementation has a merge method, which means that it
is compatible with windowing aggregation optimizations like panes or B-Int.
I have also added a fromArray convenience method to
StreamExecutionEnvironment.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ggevay/flink countDistinct
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/910.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #910
----
commit f0ae6a06daec7a016f756f9db436b2e1e56a3c46
Author: Gabor Gevay <[email protected]>
Date: 2015-07-13T19:59:10Z
[FLINK-2148] [contrib] Exact and approximate countDistinct on streams
----
> Approximately calculate the number of distinct elements of a stream
> -------------------------------------------------------------------
>
> Key: FLINK-2148
> URL: https://issues.apache.org/jira/browse/FLINK-2148
> Project: Flink
> Issue Type: Sub-task
> Components: Streaming
> Reporter: Gabor Gevay
> Assignee: Gabor Gevay
> Priority: Minor
> Labels: statistics
>
> In the paper
> http://people.seas.harvard.edu/~minilek/papers/f0.pdf
> Kane et al. describes an optimal algorithm for estimating the number of
> distinct elements in a data stream.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)