Re: [D] Cuckoo Filter Design Proposal [kvrocks]

via GitHub Wed, 30 Jul 2025 03:46:25 -0700


GitHub user LiuQhahah added a comment to the discussion: Cuckoo Filter Design 
Proposal


> For example, if in initial we have 100 buckets, in one rocksdb key, and when 
> the rehashing is triggered, we should have 300 buckets.

> And in this case, we have multiple choices.

> remain just one rocksdb key, and put all 300 buckets into this key.
split them to two rocksdb keys, i.e. two rocksdb keys, and 150 buckets in each 
key




I was influenced by RedisBloom, which uses multiple sub-filters to form a 
cuckoofilter. 
Personally, I find this approach quite clear. 
For ADD operations, if multiple attempts fail to add an item, it triggers a 
resize, allowing the item to be inserted into a new sub-filter. During queries, 
you can simply iterate through the latest sub-filter one by one. Since 
cuckoofilter guarantees O(1) time complexity for finding the bucket index, it 
requires O(n) complexity to check if there are any free slots within the 
bucket. Since the bucket size ranges from 1 to 255, I believe the O(n) 
complexity is acceptable, ensuring that the ADD operation remains relatively 
fast. The same logic applies to DELETE and QUERY operations.

Below is a diagram of CF in RocksDB after three expansions with a capacity of 
16, expansion=2, and bucket_size=2.  

```

+-------------------------------------------------------------------------------------------------+
|                                        RocksDB Storage                        
                  |
+-------------------------------------------------------------------------------------------------+
|                                                                               
                  |
|   
+-----------------------------------------------------------------------------------------+
   |
|   |                                    Metadata Entry                         
              |   |
|   |   Key: "__namespace__:cuckoo_filter:my_cf"                                
              |   |
|   |   Value: { "n_filters": 4, "base_capacity": 16, "expansion": 2, ... }     
              |   |
|   
+-----------------------------------------------------------------------------------------+
   |
|                                                                               
                  |
|   
+-----------------------------------------------------------------------------------------+
   |
|   |                                    Data Shard #0                          
              |   |
|   |                                                                           
              |   |
|   |   Key: "__cf__:__namespace__:cuckoo_filter:my_cf:0"                       
              |   |
|   |                                                                           
              |   |
|   |   Value: (Binary data, 32 bytes for 16 buckets)                           
              |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |     | Bucket 0  | Bucket 1  | Bucket 2  | ... | Bucket 15 |               
              |   |
|   |     |-----------+-----------+-----------+ ... +-----------|               
              |   |
|   |     | S0 | S1   | S0 | S1   | S0 | S1   | ... | S0 | S1   |  (S = Slot)   
              |   |
|   |     |-----------+-----------+-----------+ ... +-----------|               
              |   |
|   |     | fA | fB   | fC | fD   | fE | fF   | ... | fX | fY   |  (f = 
Fingerprint, 1 byte)  |   |
|   |     
+-------------------------------------------------------------------------------+
   |   |
|   |                                                                           
              |   |
|   
+-----------------------------------------------------------------------------------------+
   |
|                                                                               
                  |
|   
+-----------------------------------------------------------------------------------------+
   |
|   |                                    Data Shard #1                          
              |   |
|   |                                                                           
              |   |
|   |   Key: "__cf__:__namespace__:cuckoo_filter:my_cf:1"                       
              |   |
|   |                                                                           
              |   |
|   |   Value: (Binary data, 64 bytes for 32 buckets)                           
              |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |     | Bucket 0  | Bucket 1  | ... | Bucket 31 |                           
              |   |
|   |     |-----------+-----------+ ... +-----------|                           
              |   |
|   |     | S0 | S1   | S0 | S1   | ... | S0 | S1   |                           
              |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |                                                                           
              |   |
|   
+-----------------------------------------------------------------------------------------+
   |
|                                                                               
                  |
|   
+-----------------------------------------------------------------------------------------+
   |
|   |                                    Data Shard #2                          
              |   |
|   |                                                                           
              |   |
|   |   Key: "__cf__:__namespace__:cuckoo_filter:my_cf:2"                       
              |   |
|   |                                                                           
              |   |
|   |   Value: (Binary data, 128 bytes for 64 buckets)                          
              |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |     | ... (similar structure, but with 64 buckets) ...                    
                  |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |                                                                           
              |   |
|   
+-----------------------------------------------------------------------------------------+
   |
|                                                                               
                  |
|   
+-----------------------------------------------------------------------------------------+
   |
|   |                                    Data Shard #3 (Empty)                  
              |   |
|   |                                                                           
              |   |
|   |   Key: "__cf__:__namespace__:cuckoo_filter:my_cf:3"                       
              |   |
|   |                                                                           
              |   |
|   |   Value: (Binary data, 256 bytes for 128 buckets, all bytes are 0x00)     
              |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |     | Bucket 0  | Bucket 1  | ... | Bucket 127|                           
              |   |
|   |     |-----------+-----------+ ... +-----------|                           
              |   |
|   |     | 00 | 00   | 00 | 00   | ... | 00 | 00   | (All slots are 
empty/zero)              |   |
|   |     
+-----------------------------------------------------------------------------------+
   |
|   |                                                                           
              |   |
|   
+-----------------------------------------------------------------------------------------+
   |
|                                                                               
                  |
+-------------------------------------------------------------------------------------------------+

```


In addition,

I have a question want to consult, 
https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ recommends a maximum size 
of value is 3GB. 

According to the current idea, after N expansions of cuckoofilter, the chunk 
method needs to be adopted to split a single subfilter into multiple key-value 
pairs of 3GB in size. 
Is it necessary to set the threshold to 3GB soon?

When the capacity of the CF sub-filter is less than 3GB, a key-value 
representation of the sub-filter is used. 

When the capacity of the sub-filter exceeds 3GB after expansion, multiple 3GB 
key-value pairs are used to represent the sub-filter.

GitHub link: 
https://github.com/apache/kvrocks/discussions/3079#discussioncomment-13933205

----
This is an automatically sent email for issues@kvrocks.apache.org.
To unsubscribe, please send an email to: issues-unsubscr...@kvrocks.apache.org

Re: [D] Cuckoo Filter Design Proposal [kvrocks]

Reply via email to