[
https://issues.apache.org/jira/browse/FLINK-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700574#comment-16700574
]
vinoyang commented on FLINK-10993:
----------------------------------
[~StephanEwen] Yes, I believe that it is not just about the state, but Fabian
suggested that we need to find at least one committer as a leader.
I think that the bloomFilter should work well in both streams and batches,
which meets Flink's goal. I hope to introduce this feature because in Tencent,
we do encounter the need for multiple businesses to "exclude duplicates" of
massive data in streaming. They don't want to introduce systems like Redis.
I recommend splitting it into three subtasks:
1) Define the API of the DataStream/DataSet for the bloomFilter
2) Implementation of the DataStream API
3) Implementation of the DataSet API
What do you think? cc [~fhueske]
> Bring bloomfilter as a public API
> ---------------------------------
>
> Key: FLINK-10993
> URL: https://issues.apache.org/jira/browse/FLINK-10993
> Project: Flink
> Issue Type: New Feature
> Components: DataStream API
> Reporter: vinoyang
> Assignee: vinoyang
> Priority: Major
>
> Flink internally provides an implementation of BloomFilter, but only for
> internal optimization, and does not provide APIs for public access.
> Here is a user mail discussion before :
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Bloom-filter-in-Flink-td10608.html
> Considering that many users have the need to "determine duplicates" in
> streaming computing, I think it would make sense to provide such an API.
> In addition, Spark has provided BloomFilter as a public API :
> {code:java}
> val bf = df.stat.bloomFilter("dd",dataLen,0.01)
> val rightNum = rdd.map(x=>(x.toInt,bf.mightContainString(x)))
> {code}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)