[
https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628321#comment-14628321
]
ASF GitHub Bot commented on FLINK-2148:
---------------------------------------
Github user ggevay commented on the pull request:
https://github.com/apache/flink/pull/910#issuecomment-121671071
Oh, I see. :) I'm actually glad that you brought this up, because I was not
really sure how should this be done. At first I also thought to just return a
`DataStream[Long]`, but then it occured to me that if someone wants to do
something with the result that also involves the original data, then it gets
kind of awkward to reconnect them. Or would they just do a `CoMap` in that
case? I don't know...
Note, that now the newly added overload that works with the entire record
actually does return a `DataStream[Long]`, so the user also has the option to
do a project to one field and then call that overload to get a stream of Longs.
By the way, I think one could also ask the same question about
`WindowedDataStream.sum` (and similar methods): why does that does its thing on
one field and not just work with the entire records? Could you give some
details of that design decision, and how it would or would not apply here?
> Approximately calculate the number of distinct elements of a stream
> -------------------------------------------------------------------
>
> Key: FLINK-2148
> URL: https://issues.apache.org/jira/browse/FLINK-2148
> Project: Flink
> Issue Type: Sub-task
> Components: Streaming
> Reporter: Gabor Gevay
> Assignee: Gabor Gevay
> Priority: Minor
> Labels: statistics
>
> In the paper
> http://people.seas.harvard.edu/~minilek/papers/f0.pdf
> Kane et al. describes an optimal algorithm for estimating the number of
> distinct elements in a data stream.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)