[jira] [Commented] (FLINK-2148) Approximately calculate the number of distinct elements of a stream

ASF GitHub Bot (JIRA) Wed, 15 Jul 2015 09:35:28 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628321#comment-14628321
 ]


ASF GitHub Bot commented on FLINK-2148:
---------------------------------------

Github user ggevay commented on the pull request:

    https://github.com/apache/flink/pull/910#issuecomment-121671071
  
    Oh, I see. :) I'm actually glad that you brought this up, because I was not 
really sure how should this be done. At first I also thought to just return a 
`DataStream[Long]`, but then it occured to me that if someone wants to do 
something with the result that also involves the original data, then it gets 
kind of awkward to reconnect them. Or would they just do a `CoMap` in that 
case? I don't know...
    
    Note, that now the newly added overload that works with the entire record 
actually does return a `DataStream[Long]`, so the user also has the option to 
do a project to one field and then call that overload to get a stream of Longs.
    
    By the way, I think one could also ask the same question about 
`WindowedDataStream.sum` (and similar methods): why does that does its thing on 
one field and not just work with the entire records? Could you give some 
details of that design decision, and how it would or would not apply here?


> Approximately calculate the number of distinct elements of a stream
> -------------------------------------------------------------------
>
>                 Key: FLINK-2148
>                 URL: https://issues.apache.org/jira/browse/FLINK-2148
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Minor
>              Labels: statistics
>
> In the paper
> http://people.seas.harvard.edu/~minilek/papers/f0.pdf
> Kane et al. describes an optimal algorithm for estimating the number of 
> distinct elements in a data stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2148) Approximately calculate the number of distinct elements of a stream

Reply via email to