GitHub user jonathanc-n edited a discussion: Count-Min Sketch Support

References: #2425 

### Overview
Count-Min Sketch is a probabilistic data structure used to estimate the 
frequency of elements in a data stream. It is particularly useful in scenarios 
where memory efficiency is important, such as network traffic analysis, 
real-time analytics, etc.

The core idea behind Count-Min Sketch is to use multiple hash functions to map 
each element in a data stream to different positions in a set of arrays. Each 
bucket in these arrays holds a count that represents the estimated frequency of 
the elements mapped to it.

Hashing: When an element appears in the data stream, it is passed through 
multiple hash functions. Each hash function produces an index corresponding to 
a position in a different array. The count at each of these positions is then 
incremented.

Estimating Frequency: To estimate the frequency of an element, the element is 
passed through the same hash functions, and the counts at the corresponding 
positions in the arrays are retrieved. The minimum value among these counts is 
taken as the estimate for the frequency of that element. This approach helps 
mitigate overestimation caused by hash collisions.

### Structure of Count-Min Sketch
Array of Counters: The Count-Min Sketch maintains a 2-D array, where each row 
corresponds to a different hash function, and each column represents a possible 
index generated by these hash functions. The size of the array is determined by 
the desired accuracy and confidence level. The more rows and columns, the more 
accurate the results are at the cost of increased memory usage.

Updating the Sketch: When a new element is encountered, it is hashed multiple 
times, each hash directing the element to a different row of counters. The 
counts at the corresponding columns are incremented.

Querying: To estimate the frequency of an element, the same hash functions are 
applied, and the counts from the respective positions are retrieved. 

### Supported Commands on Redis
CMS.INCRBY - Increases the count of one or more items by increment.
CMS.INFO - Returns information about a sketch.
CMS.INITBYDIM - Initializes a Count-Min Sketch to dimensions specified by the 
user.
CMS.INITBYPROB - Initializes a Count-Min Sketch to accommodate requested 
tolerances.
CMS.MERGE - Merges several sketches into one sketch.
CMS.QUERY - Returns the count for one or more items in a sketch.

### Implementation
I was looking into Redis' count min sketch implementation and noticed these 
things:

Hash Function Choice: Redis uses a modified version of the MurmurHash2 function 
to generate hash values for the items. 

Storage Format
CMS Array: The CMS data structure is stored as a 2D array (flattened into a 1D 
array) of uint32_t integers. Each element in this array represents the count 
for a particular combination of hash function and item. The size of the array 
is determined by the width (number of columns) and depth (number of rows) of 
the CMS. 
- Storing the entire array with a single serialized key, with possibility of 
subkeys if parts of the array are being 

String Representation: In Redis, data structures like strings and bitmaps are 
generally treated as "string" types. However, for CMS, the structure is 
specifically tailored to handle frequency counts. The memory is made to be 
manipulated easily. Counts will be stored as uin32_t in binary format making it 
super compact.

Metadata: Metadata like width, depth, and counter can be stored at the 
beginning of the serialized value if using a single key-value pair.

Tuning
Redis' CMS implementation follows a fixed configuration for the depth and width 
based on error and probability parameters. However, tuning these parameters can 
significantly impact the performance and accuracy of the CMS:

Depth and Width Calculation: The width and depth of the CMS can be derived from 
the desired error rate (error) and the probability of failure (delta). The 
function CMS_DimFromProb calculates the width as ceil(2 / error) and the depth 
as ceil(log10f(delta) / log10f(0.5)). This fixed-size approach simplifies the 
usage and ensures consistent performance.

Merging CMS Structures: The function CMS_Merge allows combining multiple CMS 
instances into one. This is particularly useful in distributed systems where 
frequency data from different nodes need to be aggregated. The function takes 
care of summing the counts from multiple sources while checking for potential 
overflows to prevent data corruption.

I will be looking into how to properly integrate this with the storage layer, 
and support the commands on top of it, I will share my solution here once I 
have properly mapped it out. If anyone has any insights please feel free to 
share

Note: will be adding complexities 

Resources
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf 
https://github.com/RedisBloom/RedisBloom/blob/master/docs/docs/count-min-sketch.md
 
https://redis.io/docs/latest/develop/data-types/probabilistic/count-min-sketch/
https://redis.io/docs/latest/commands/?group=cms

ty @mapleFU for the format outline :)


 

GitHub link: https://github.com/apache/kvrocks/discussions/2496

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to