GitHub user hidedim created a discussion: TopK proposal
# Introduction In [#3176](https://github.com/apache/kvrocks/issues/3176), we plan to support the TopK algorithm in Redis. RedisTopK is a redis module that can be used find top-k elephant flows. It is based on the [TopK](https://www.usenix.org/conference/atc18/presentation/gong), which is a probabilistic data structure that can be used to find the top k elements by frequency of occurrence in a set. # Core Concepts 1. Fingerprint: A fingerprint is a value of uint32_t that is generated from a given data element by hash function. It is used to judge whether two data elements are the same in Bucket. 2. Bucket: Store the fingerprint and the count of occurrence of the data element. 3. Min-Heap: Store the k most frequency elements. # Data Structure Design ## Metadata |Feield|Type|Description| |:--:|:---|:---| |k|uint32_t|The number of top k elements| |width|uint32_t|Bucket has how many capicity. Defalut value: 7| |depth|uint32_t|number of Bucket, Defalut value: 8| |decay|double|decay factor: Defalut value: 0.9| +---------+----------+-----------+---------+---------+---------+---------+ key => | flags | expire | version | k | width | depth | decay | | 4 bytes | 4 bytes | 4 bytes | 4 bytes | 4 bytes | 4 bytes | 8 bytes | +---------+----------+-----------+---------+---------+---------+---------+ ## Bucket |Feield|Type|Description| |:--:|:---|:---| |fingerprint|uint32_t|The fingerprint of the data element| |count|uint32_t|The count of occurrence of the data element| +-------------+-----------+ key => | fingerprint | count | | 4 bytes | 4 bytes | +-------------+-----------+ ```cpp struct Bucket { uint32_t fp; counter_t count; }; ``` ## HeapBucket |Feield|Type|Description| |:--:|:---|:---| |fp|uint32_t|The fingerprint of the data element| |count|uint32_t|The count of occurrence of the data element| |itemlen|uint32_t|The length of the data element| |item|char*|The data element| +-------------+-----------+-----------+-----------+ key => | fingerprint | count | itemlen | item | | 4 bytes | 4 bytes | 4 bytes | 8 bytes | +-------------+-----------+-----------+-----------+ ```cpp struct HeapBucket { uint32_t fp; uint32_t itemlen; char* item; counter_t count; }; ``` ## TopK |Feield|Type|Description| |:--:|:---|:---| |heap_size|uint32_t|Current length of heap| |bucket|Bucket*|Store Bucket of width * depth| |heap|HeapBucket*|Store HeapBucket of k| +-----------+-----------+-----------+ key => | heap_size | bucket | heap | | 4 bytes | 8 bytes | 8 bytes | +-----------+-----------+-----------+ ```cpp size_t heap_size; Bucket *buckets; HeapBucket *heap; ``` # Operations |Operation|Description| |:--:|:---| |TOPK.RESERVE|Init the Top-K structure| |TOPK.ADD|Add an element to the Top-K structure| |TOPK.QUERY|Query element is in the Top-K structure| |TOPK.LIST|List k elements in the Top-K structure| |TOPK.INFO|Get the information of the Top-K metadata structure| ## TOPK.RESERVE Init the topk structure. ### Parameters: key: The key name of the Top-K structure. k: The number of highest-frequency elements to track. width(optional): The number of counters in each bucket, with default value of 7. depth(optional): The number of buckets in each level, with default value of 8. decay(optional): The decay probability when counters collide, with default value of 0.9. ### example: ```redis TOPK.RESERVE key k [width] [depth] [decay] ``` ### Implementation: 1. Create a new Top-K metadata structure. 2. Set the width, depth and decay to the given values or default values. 3. Create a new Top-K structure. 4. Initialize Bucket and HeapBucket by zero and nullptr, and set heap_size to 0. 5. Return OK. ## TOPK.ADD Add an element to the Top-K structure. ### Parameters: key: The key name of the Top-K structure. item: The element to be added. ### example: ```redis TOPK.ADD key item ``` ### Implementation: 1. Read the metadata and the topk structure. 2. Check if the item is existed in HeapBucket. 3. For each bucket in Topk structure: each bucket has own hash function. Compute the hash value of the item and check location of hash value. Case 1: the bucket count is 0, set fingerprint of the item and count to 1. Case 2: if the bucket fingerprint equals to the item fingerprint, and the item is in HeapBucket, the count of the bucket increase 1. Case 3: decay the count of the bucket. if count is 0, set fingerprint of the item and count to 1. 4. Update HeapBucket. For each bucket in Topk structure: record the maxcount of item. If item is in HeapBucket, update the count of item in HeapBucket. If item is not in HeapBucket, only maxcount of item is equal to mincount of HeapBucket or 1 + mincount of HeapBucket, update the top HeapBucket to item. 5. Update TopK structure. Store the Buckets and HeapBucket to Redis. ## TOPK.QUERY Return is element in the Top-K structure. ### Parameters: key: The key name of the Top-K structure. item: The element to be queried. ### example: ```redis TOPK.QUERY key item ``` ### Implementation: 1. Read the metadata and the topk structure. 2. Check if the element is in heap of topk by comparing char* item. 3. Return 1 if the element is in the topk structure, otherwise return 0. ## TOPK.LIST Return the list of elements in the Top-K structure. ### Parameters: key: The key name of the Top-K structure. ### example: ```redis TOPK.LIST key ``` ### Implementation: 1. Read the metadata and the topk structure. 2. Return the list of elements in heap of topk. ## TOPK.INFO Return the information of the Top-K structure. ### Parameters: key: The key name of the Top-K structure. ### example: ```redis TOPK.INFO key ``` ### Implementation: 1. Read the metadata and parse it. 2. Return the metadata. ## TOPK.INCRBY Increment the count of an element in the Top-K structure. ### Parameters: key: The key name of the Top-K structure. item: The element to be incremented. count: The amount to increment the count by. ### example: ```redis TOPK.INCRBY key item count ``` ### Implementation: similar to TOPK.ADD, but instead of adding one count of element to the topk structure, it increments the count of the element by your parameter. # Reference https://redis.io/docs/latest/develop/data-types/probabilistic/top-k/ https://redis.io/docs/latest/commands/?group=topk https://www.usenix.org/conference/atc18/presentation/gong GitHub link: https://github.com/apache/kvrocks/discussions/3259 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
