[GitHub] [kvrocks] mapleFU edited a discussion: Talk about Metadata for BloomFilter

via GitHub Fri, 25 Aug 2023 00:44:36 -0700


GitHub user mapleFU edited a discussion: Talk about Metadata for BloomFilter


To be briefly, we're plan to support RedisBloom[1]. We plan to use 
`BlockSplitBloomFilter` as the underlying BloomFilter[2], because it's widely 
used by database by database systems like RocksDB[3], Impala, Kudu and other 
systems, it was merged in kvrocks.

Now we plan to support RedisBloom, the original redis bloom has two-level:

```
SBChain: a chain of in-memory bloom filter
Bloom: BloomFilter implemention
```

We can easily support this, however, we need to design our metadata carefully.

## Draft1

Currently, the draft is:

```
ChainMetadata: (num_filters, scaling, expansion)
BFMetadata: (num_distinct, size, false-positive-rate)
```

For `ChainMetadata`:
1. `num_filters` is the number of underlying BFMetadata. It would be 
initialized as "1". And it can grow when "scaling" enabled.
2. `scaling` and `expansion`: Adding an element to a Bloom filter never fails 
due to the data structure "filling up". Instead the error rate starts to grow. 
To keep the error close to the one set on filter initialisation - the bloom 
filter will auto-scale, meaning when capacity is reached an additional 
sub-filter will be created. The size of the new sub-filter is the size of the 
last sub-filter multiplied by EXPANSION. The default expansion value is 2.

For `BFMetadata`:
1. `num_distinct`: the distinct value in this bloomFilter
2. `size`: the size of bytes for bloom filter
3. `false-positive-rate`: fpp for filter.

During reading:
1. Get `ChainMetadata`
2. Get all `BFMetadata`
3. searching from largest to smallest ( this keeps same as RedisBloom, 
personally I think we can search from lowest to highest)

During writing:
1. Get `ChainMetadata`
2. Get all `BFMetadata`
3. searching from largest to smallest ( this keeps same as RedisBloom, 
personally I think we can search from lowest to highest), if exists, do nothing
4. Else, try to insert to largest filter. If last filter is not full, we can 
insert to last one. Else we need to allocate a new filter.

### Cons

Need two "metadata", which might a bit complex. And during "add", they should 
all be updated.

### Pros

Flexible, we can separate the bf and bf-chain metadata.


## Draft2

```
ChainMetadata: (num_filters, scaling, expansion, base_size, sum_num_distinct, 
false-positive-rate)
```

Other arguments is same as previous one. However, we can deduce num of each 
filter by `base_size` and `(scaling, expansion)`. This avoid a `BFMetadata`

### Cons

Not so easy to evolution the underlying BF format.

### Pros

Only one metadata, is convinient

## References

[1] https://redis.io/docs/stack/bloom/
[2] https://github.com/apache/parquet-format/blob/master/BloomFilter.md
[3] 
https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#full-filters-new-format

GitHub link: https://github.com/apache/kvrocks/discussions/1701

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[GitHub] [kvrocks] mapleFU edited a discussion: Talk about Metadata for BloomFilter

Reply via email to