[ 
https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626360#comment-14626360
 ] 

ASF GitHub Bot commented on FLINK-2148:
---------------------------------------

GitHub user ggevay opened a pull request:

    https://github.com/apache/flink/pull/910

    [FLINK-2148] [contrib] Exact and approximate countDistinct on streams

    For the approximate calculation I used the HyperLogLog implementation in 
the Clearspring library.
    
    Currently it operates only on the entire stream, but when the dust settles 
around the windowing rewrite, I will modify this code to work on windows 
instead. The Clearspring implementation has a merge method, which means that it 
is compatible with windowing aggregation optimizations like panes or B-Int.
    
    I have also added a fromArray convenience method to 
StreamExecutionEnvironment.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ggevay/flink countDistinct

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/910.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #910
    
----
commit f0ae6a06daec7a016f756f9db436b2e1e56a3c46
Author: Gabor Gevay <[email protected]>
Date:   2015-07-13T19:59:10Z

    [FLINK-2148] [contrib] Exact and approximate countDistinct on streams

----


> Approximately calculate the number of distinct elements of a stream
> -------------------------------------------------------------------
>
>                 Key: FLINK-2148
>                 URL: https://issues.apache.org/jira/browse/FLINK-2148
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Minor
>              Labels: statistics
>
> In the paper
> http://people.seas.harvard.edu/~minilek/papers/f0.pdf
> Kane et al. describes an optimal algorithm for estimating the number of 
> distinct elements in a data stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to