[ 
https://issues.apache.org/jira/browse/FLINK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sihua Zhou updated FLINK-8601:
------------------------------
    Description: 
h3. Backgroud

Bloom filter is useful in many situation, for example:
 * 1. Approximate calculation: deduplication (eg: UV calculation)
 * 2. Performance optimization: eg, [runtime filter 
join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]
       By using BF, we can greatly reduce the number of queries for state data 
in a stream join, and these filtered queries will eventually fail to find any 
results, which is a poor performance for rocksdb-based state due to traversing 
```sst``` on the disk. 

However, based on the current status provided by flink, it is hard to use the 
bloom filter for the following reasons:
 * 1. Serialization problem: Bloom filter status can be large (for example: 
100M), if implement it based on the RocksDB state, the state data will need to 
be serialized each time it is queried and updated, and the performance will be 
very poor.
 * 2. Data skewed: Data in different key group can be skewed, and the 
information of data skewed can not be accurately predicted before the program 
is running. Therefore, it is impossible to determine how much resources bloom 
filter should allocate. One way to do this is to allocate space needed for the 
most skewed case, but this can lead to very serious waste of resources.

h3. Requirement

Therefore, I introduce the LinkedBloomFilter for flink, which at least need to 
meet the following features:
 * 1. Support for changing Parallelism
 * 2. Only serialize when necessary: when performing checkpoint
 * 3. Can deal with data skew problem: users only need to specify a 
LinkedBloomFilterState with the desired input, fpp, system will allocate 
resource dynamic.
 * 4. Do not conflict with other state: user can use KeyedState and 
OperateState when using bloom filter state.
 * 5. Support relax ttl (ie: the data survival time at least greater than the 
specified time)

Design doc:  [design 
doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing]

  was:
h3. Backgroud

Bloom filter is useful in many situation, for example:
 * 1. Approximate calculation: deduplication (eg: UV calculation)
 * 2. Performance optimization: eg, [runtime filter 
join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]

However, based on the current status provided by flink, it is hard to use the 
bloom filter for the following reasons:
 * 1. Serialization problem: Bloom filter status can be large (for example: 
100M), if implement it based on the RocksDB state, the state data will need to 
be serialized each time it is queried and updated, and the performance will be 
very poor.
 * 2. Data skewed: Data in different key group can be skewed, and the 
information of data skewed can not be accurately predicted before the program 
is running. Therefore, it is impossible to determine how much resources bloom 
filter should allocate. One way to do this is to allocate space needed for the 
most skewed case, but this can lead to very serious waste of resources.

h3. Requirement

Therefore, I introduce the LinkedBloomFilterState for flink, which at least 
need to meet the following features:
 * 1. Support for changing Parallelism
 * 2. Only serialize when necessary: when performing checkpoint
 * 3. Can deal with data skew problem: users only need to specify a 
LinkedBloomFilterState with the desired input, fpp, system will allocate 
resource dynamic.
 * 4. Do not conflict with other state: user can use KeyedState and 
OperateState when using bloom filter state.
 * 5. Support relax ttl (ie: the data survival time at least greater than the 
specified time)

Design doc:  [design 
doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing]

        Summary: Introduce LinkedBloomFilter for Approximate calculation and 
other situations of performance optimization  (was: Introduce 
LinkedBloomFilterState for Approximate calculation and other situations of 
performance optimization)

> Introduce LinkedBloomFilter for Approximate calculation and other situations 
> of performance optimization
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8601
>                 URL: https://issues.apache.org/jira/browse/FLINK-8601
>             Project: Flink
>          Issue Type: New Feature
>          Components: Core, DataStream API
>    Affects Versions: 1.4.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Major
>
> h3. Backgroud
> Bloom filter is useful in many situation, for example:
>  * 1. Approximate calculation: deduplication (eg: UV calculation)
>  * 2. Performance optimization: eg, [runtime filter 
> join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]
>        By using BF, we can greatly reduce the number of queries for state 
> data in a stream join, and these filtered queries will eventually fail to 
> find any results, which is a poor performance for rocksdb-based state due to 
> traversing ```sst``` on the disk. 
> However, based on the current status provided by flink, it is hard to use the 
> bloom filter for the following reasons:
>  * 1. Serialization problem: Bloom filter status can be large (for example: 
> 100M), if implement it based on the RocksDB state, the state data will need 
> to be serialized each time it is queried and updated, and the performance 
> will be very poor.
>  * 2. Data skewed: Data in different key group can be skewed, and the 
> information of data skewed can not be accurately predicted before the program 
> is running. Therefore, it is impossible to determine how much resources bloom 
> filter should allocate. One way to do this is to allocate space needed for 
> the most skewed case, but this can lead to very serious waste of resources.
> h3. Requirement
> Therefore, I introduce the LinkedBloomFilter for flink, which at least need 
> to meet the following features:
>  * 1. Support for changing Parallelism
>  * 2. Only serialize when necessary: when performing checkpoint
>  * 3. Can deal with data skew problem: users only need to specify a 
> LinkedBloomFilterState with the desired input, fpp, system will allocate 
> resource dynamic.
>  * 4. Do not conflict with other state: user can use KeyedState and 
> OperateState when using bloom filter state.
>  * 5. Support relax ttl (ie: the data survival time at least greater than the 
> specified time)
> Design doc:  [design 
> doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to