[ https://issues.apache.org/jira/browse/FLINK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356645#comment-16356645 ]
Fabian Hueske commented on FLINK-8601: -------------------------------------- Hi [~sihuazhou], thanks for opening this Jira and preparing a design document. I see that the status of the issue is "In Progress" so I assume that you are already working on this feature. Has there been a discussion about adding a new state primitive that I missed? I'm asking because adding a new state primitive means that it has to be supported by all state backends which is a major implementation effort. At the same time, the proposed state type seems to address only very specific use cases. Before starting to work on this, I would get the opinion of other developers whether this is a feature that we'd like to add to Flink or if there is another way to solve the problem. [~stefanrichte...@gmail.com], [~aljoscha] What do you think? Thanks, Fabian > Introduce LinkedBloomFilterState for Approximate calculation and other > situations of performance optimization > ------------------------------------------------------------------------------------------------------------- > > Key: FLINK-8601 > URL: https://issues.apache.org/jira/browse/FLINK-8601 > Project: Flink > Issue Type: New Feature > Components: Core, DataStream API > Affects Versions: 1.4.0 > Reporter: Sihua Zhou > Assignee: Sihua Zhou > Priority: Major > > h3. Backgroud > Bloom filter is useful in many situation, for example: > * 1. Approximate calculation: deduplication (eg: UV calculation) > * 2. Performance optimization: eg, [runtime filter > join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html] > However, based on the current status provided by flink, it is hard to use the > bloom filter for the following reasons: > * 1. Serialization problem: Bloom filter status can be large (for example: > 100M), if implement it based on the RocksDB state, the state data will need > to be serialized each time it is queried and updated, and the performance > will be very poor. > * 2. Data skewed: Data in different key group can be skewed, and the > information of data skewed can not be accurately predicted before the program > is running. Therefore, it is impossible to determine how much resources bloom > filter should allocate. One way to do this is to allocate space needed for > the most skewed case, but this can lead to very serious waste of resources. > h3. Requirement > Therefore, I introduce the LinkedBloomFilterState for flink, which at least > need to meet the following features: > * 1. Support for changing Parallelism > * 2. Only serialize when necessary: when performing checkpoint > * 3. Can deal with data skew problem: users only need to specify a > LinkedBloomFilterState with the desired input, fpp, system will allocate > resource dynamic. > * 4. Do not conflict with other state: user can use KeyedState and > OperateState when using bloom filter state. > * 5. Support relax ttl (ie: the data survival time at least greater than the > specified time) > Design doc: [design > doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005)