[
https://issues.apache.org/jira/browse/FLINK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sihua Zhou updated FLINK-8601:
------------------------------
Affects Version/s: (was: 1.4.0)
1.5.0
> Introduce PartitionedBloomFilter for Approximate calculation and other
> situations of performance optimization
> -------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-8601
> URL: https://issues.apache.org/jira/browse/FLINK-8601
> Project: Flink
> Issue Type: New Feature
> Components: DataStream API, State Backends, Checkpointing
> Affects Versions: 1.5.0
> Reporter: Sihua Zhou
> Assignee: Sihua Zhou
> Priority: Major
>
> h3. Backgroud
> Bloom filter is useful in many situation, for example:
> * 1. Approximate calculation: deduplication (eg: UV calculation)
> * 2. Performance optimization: eg, [runtime filter
> join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]
> By using BF, we can greatly reduce the number of queries for state
> data in a stream join, and these filtered queries will eventually fail to
> find any results, which is a poor performance for rocksdb-based state due to
> traversing ```sst``` on the disk.
> However, based on the current status provided by flink, it is hard to use the
> bloom filter for the following reasons:
> * 1. Serialization problem: Bloom filter status can be large (for example:
> 100M), if implement it based on the RocksDB state, the state data will need
> to be serialized each time it is queried and updated, and the performance
> will be very poor.
> * 2. Data skewed: Data in different key group can be skewed, and the
> information of data skewed can not be accurately predicted before the program
> is running. Therefore, it is impossible to determine how much resources bloom
> filter should allocate. One way to do this is to allocate space needed for
> the most skewed case, but this can lead to very serious waste of resources.
> h3. Requirement
> Therefore, I introduce the PartitionedBloomFilter for flink, which at least
> need to meet the following features:
> * 1. Support for changing Parallelism
> * 2. Only serialize when necessary: when performing checkpoint
> * 3. Can deal with data skew problem: users only need to specify a
> PartitionedBloomFilter with the desired input, fpp, system will allocate
> resource dynamic.
> * 4. Do not conflict with other state: user can use KeyedState and
> OperateState when using this bloom filter.
> * 5. Support relax ttl (ie: the data survival time at least greater than the
> specified time)
> Design doc: [design
> doc|https://docs.google.com/document/d/1s8w2dkNFDM9Fb2zoHwHY0hJRrqatAFta42T97nDXmqc/edit?usp=sharing]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)