GitHub user mapleFU edited a discussion: Talk about Metadata for BloomFilter
To be briefly, we're plan to support RedisBloom[1]. We plan to use `BlockSplitBloomFilter` as the underlying BloomFilter[2], because it's widely used by database by database systems like RocksDB[3], Impala, Kudu and other systems, it was merged in kvrocks. Now we plan to support RedisBloom, the original redis bloom has two-level: ``` SBChain: a chain of in-memory bloom filter Bloom: BloomFilter implemention ``` We can easily support this, however, we need to design our metadata carefully. ## Draft1 Currently, the draft is: ``` ChainMetadata: (num_filters, scaling, expansion) BFMetadata: (num_distinct, size, false-positive-rate) ``` For `ChainMetadata`: 1. `num_filters` is the number of underlying BFMetadata. It would be initialized as "1". And it can grow when "scaling" enabled. 2. `scaling` and `expansion`: Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created. The size of the new sub-filter is the size of the last sub-filter multiplied by EXPANSION. The default expansion value is 2. For `BFMetadata`: 1. `num_distinct`: the distinct value in this bloomFilter 2. `size`: the size of bytes for bloom filter 3. `false-positive-rate`: fpp for filter. During reading: 1. Get `ChainMetadata` 2. Get all `BFMetadata` 3. searching from largest to smallest ( this keeps same as RedisBloom, personally I think we can search from lowest to highest) During writing: 1. Get `ChainMetadata` 2. Get all `BFMetadata` 3. searching from largest to smallest ( this keeps same as RedisBloom, personally I think we can search from lowest to highest), if exists, do nothing 4. Else, try to insert to largest filter. If last filter is not full, we can insert to last one. Else we need to allocate a new filter. ### Cons Need two "metadata", which might a bit complex. And during "add", they should all be updated. ### Pros Flexible, we can separate the bf and bf-chain metadata. ## Draft2 ``` ChainMetadata: (num_filters, scaling, expansion, base_size, sum_num_distinct, false-positive-rate) ``` Other arguments is same as previous one. However, we can deduce num of each filter by `base_size` and `(scaling, expansion)`. This avoid a `BFMetadata` ### Cons Not so easy to evolution the underlying BF format. ### Pros Only one metadata, is convinient ## References [1] https://redis.io/docs/stack/bloom/ [2] https://github.com/apache/parquet-format/blob/master/BloomFilter.md [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#full-filters-new-format GitHub link: https://github.com/apache/kvrocks/discussions/1701 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
