GitHub user LiuQhahah added a comment to the discussion: Cuckoo Filter Design Proposal
> For example, if in initial we have 100 buckets, in one rocksdb key, and when > the rehashing is triggered, we should have 300 buckets. > And in this case, we have multiple choices. > remain just one rocksdb key, and put all 300 buckets into this key. split them to two rocksdb keys, i.e. two rocksdb keys, and 150 buckets in each key I was influenced by RedisBloom, which uses multiple sub-filters to form a cuckoofilter. Personally, I find this approach quite clear. For ADD operations, if multiple attempts fail to add an item, it triggers a resize, allowing the item to be inserted into a new sub-filter. During queries, you can simply iterate through the latest sub-filter one by one. Since cuckoofilter guarantees O(1) time complexity for finding the bucket index, it requires O(n) complexity to check if there are any free slots within the bucket. Since the bucket size ranges from 1 to 255, I believe the O(n) complexity is acceptable, ensuring that the ADD operation remains relatively fast. The same logic applies to DELETE and QUERY operations. Below is a diagram of CF in RocksDB after three expansions with a capacity of 16, expansion=2, and bucket_size=2. ``` +-------------------------------------------------------------------------------------------------+ | RocksDB Storage | +-------------------------------------------------------------------------------------------------+ | | | +-----------------------------------------------------------------------------------------+ | | | Metadata Entry | | | | Key: "__namespace__:cuckoo_filter:my_cf" | | | | Value: { "n_filters": 4, "base_capacity": 16, "expansion": 2, ... } | | | +-----------------------------------------------------------------------------------------+ | | | | +-----------------------------------------------------------------------------------------+ | | | Data Shard #0 | | | | | | | | Key: "__cf__:__namespace__:cuckoo_filter:my_cf:0" | | | | | | | | Value: (Binary data, 32 bytes for 16 buckets) | | | | +-----------------------------------------------------------------------------------+ | | | | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket 15 | | | | | |-----------+-----------+-----------+ ... +-----------| | | | | | S0 | S1 | S0 | S1 | S0 | S1 | ... | S0 | S1 | (S = Slot) | | | | |-----------+-----------+-----------+ ... +-----------| | | | | | fA | fB | fC | fD | fE | fF | ... | fX | fY | (f = Fingerprint, 1 byte) | | | | +-------------------------------------------------------------------------------+ | | | | | | | +-----------------------------------------------------------------------------------------+ | | | | +-----------------------------------------------------------------------------------------+ | | | Data Shard #1 | | | | | | | | Key: "__cf__:__namespace__:cuckoo_filter:my_cf:1" | | | | | | | | Value: (Binary data, 64 bytes for 32 buckets) | | | | +-----------------------------------------------------------------------------------+ | | | | Bucket 0 | Bucket 1 | ... | Bucket 31 | | | | | |-----------+-----------+ ... +-----------| | | | | | S0 | S1 | S0 | S1 | ... | S0 | S1 | | | | | +-----------------------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------------------------+ | | | | +-----------------------------------------------------------------------------------------+ | | | Data Shard #2 | | | | | | | | Key: "__cf__:__namespace__:cuckoo_filter:my_cf:2" | | | | | | | | Value: (Binary data, 128 bytes for 64 buckets) | | | | +-----------------------------------------------------------------------------------+ | | | | ... (similar structure, but with 64 buckets) ... | | | | +-----------------------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------------------------+ | | | | +-----------------------------------------------------------------------------------------+ | | | Data Shard #3 (Empty) | | | | | | | | Key: "__cf__:__namespace__:cuckoo_filter:my_cf:3" | | | | | | | | Value: (Binary data, 256 bytes for 128 buckets, all bytes are 0x00) | | | | +-----------------------------------------------------------------------------------+ | | | | Bucket 0 | Bucket 1 | ... | Bucket 127| | | | | |-----------+-----------+ ... +-----------| | | | | | 00 | 00 | 00 | 00 | ... | 00 | 00 | (All slots are empty/zero) | | | | +-----------------------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------------------------+ | | | +-------------------------------------------------------------------------------------------------+ ``` In addition, I have a question want to consult, https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ recommends a maximum size of value is 3GB. According to the current idea, after N expansions of cuckoofilter, the chunk method needs to be adopted to split a single subfilter into multiple key-value pairs of 3GB in size. Is it necessary to set the threshold to 3GB soon? When the capacity of the CF sub-filter is less than 3GB, a key-value representation of the sub-filter is used. When the capacity of the sub-filter exceeds 3GB after expansion, multiple 3GB key-value pairs are used to represent the sub-filter. GitHub link: https://github.com/apache/kvrocks/discussions/3079#discussioncomment-13933205 ---- This is an automatically sent email for issues@kvrocks.apache.org. To unsubscribe, please send an email to: issues-unsubscr...@kvrocks.apache.org