GitHub user hidedim created a discussion: TopK proposal

# Introduction
In [#3176](https://github.com/apache/kvrocks/issues/3176), we plan to support 
the TopK algorithm in Redis.

RedisTopK is a redis module that can be used find top-k elephant flows. It is 
based on the [TopK](https://www.usenix.org/conference/atc18/presentation/gong), 
which is a probabilistic data structure that can be used to find the top k 
elements by frequency of occurrence in a set.

# Core Concepts
1. Fingerprint: A fingerprint is a value of uint32_t that is generated from a 
given data element by hash function. It is used to judge whether two data 
elements are the same in Bucket.
2. Bucket: Store the fingerprint and the count of occurrence of the data 
element.
3. Min-Heap: Store the k most frequency elements.


# Data Structure Design

## Metadata
|Feield|Type|Description|
|:--:|:---|:---|
|k|uint32_t|The number of top k elements|
|width|uint32_t|Bucket has how many capicity. Defalut value: 7|
|depth|uint32_t|number of Bucket, Defalut value: 8|
|decay|double|decay factor: Defalut value: 0.9|


        
+---------+----------+-----------+---------+---------+---------+---------+
key =>  |  flags  |  expire  |  version  |    k    |  width  |  depth  |  decay 
 |
        | 4 bytes | 4 bytes  |  4 bytes  | 4 bytes | 4 bytes | 4 bytes | 8 
bytes |
        
+---------+----------+-----------+---------+---------+---------+---------+


## Bucket
|Feield|Type|Description|
|:--:|:---|:---|
|fingerprint|uint32_t|The fingerprint of the data element|
|count|uint32_t|The count of occurrence of the data element|

        +-------------+-----------+
key =>  | fingerprint |   count   |
        |   4 bytes   |  4 bytes  |
        +-------------+-----------+
```cpp
struct Bucket {
    uint32_t fp;
    counter_t count;
};
```

## HeapBucket
|Feield|Type|Description|
|:--:|:---|:---|
|fp|uint32_t|The fingerprint of the data element|
|count|uint32_t|The count of occurrence of the data element|
|itemlen|uint32_t|The length of the data element|
|item|char*|The data element|

        +-------------+-----------+-----------+-----------+
key =>  | fingerprint |   count   |  itemlen  |    item   |
        |   4 bytes   |  4 bytes  |  4 bytes  |  8 bytes  |
        +-------------+-----------+-----------+-----------+

```cpp
struct HeapBucket {
    uint32_t fp;
    uint32_t itemlen;
    char* item;
    counter_t count;
};
```

## TopK
|Feield|Type|Description|
|:--:|:---|:---|
|heap_size|uint32_t|Current length of heap|
|bucket|Bucket*|Store Bucket of width * depth|
|heap|HeapBucket*|Store HeapBucket of k|

        +-----------+-----------+-----------+
key =>  | heap_size |   bucket  |    heap   |
        |  4 bytes  |  8 bytes  |  8 bytes  |
        +-----------+-----------+-----------+
```cpp
size_t heap_size;
Bucket *buckets;
HeapBucket *heap;
```

# Operations
|Operation|Description|
|:--:|:---|
|TOPK.RESERVE|Init the Top-K structure|
|TOPK.ADD|Add an element to the Top-K structure|
|TOPK.QUERY|Query element is in the Top-K structure|
|TOPK.LIST|List k elements in the Top-K structure|
|TOPK.INFO|Get the information of the Top-K metadata structure|


## TOPK.RESERVE
Init the topk structure.

### Parameters:
key: The key name of the Top-K structure.
k: The number of highest-frequency elements to track.
width(optional): The number of counters in each bucket, with default value of 7.
depth(optional): The number of buckets in each level, with default value of 8.
decay(optional): The decay probability when counters collide, with default 
value of 0.9.

### example:
```redis
TOPK.RESERVE key k [width] [depth] [decay]
```

### Implementation:
1. Create a new Top-K metadata structure.
2. Set the width, depth and decay to the given values or default values.
3. Create a new Top-K structure.
4. Initialize Bucket and HeapBucket by zero and nullptr, and set heap_size to 0.
5. Return OK.


## TOPK.ADD
Add an element to the Top-K structure.

### Parameters:
key: The key name of the Top-K structure.
item: The element to be added.

### example:
```redis
TOPK.ADD key item
```

### Implementation:
1. Read the metadata and the topk structure.
2. Check if the item is existed in HeapBucket.
3. For each bucket in Topk structure: each bucket has own hash function. 
Compute the hash value of the item and check location of hash value. 
Case 1: the bucket count is 0, set fingerprint of the item and count to 1.
Case 2: if the bucket fingerprint equals to the item fingerprint, and the item 
is in HeapBucket, the count of the bucket increase 1.
Case 3: decay the count of the bucket. if count is 0, set fingerprint of the 
item and count to 1.
4. Update HeapBucket. For each bucket in Topk structure: record the maxcount of 
item. If item is in HeapBucket, update the count of item in HeapBucket. If item 
is not in HeapBucket, only maxcount of item is equal to mincount of HeapBucket 
or 1 + mincount of HeapBucket, update the top HeapBucket to item.
5. Update TopK structure. Store the Buckets and HeapBucket to Redis.


## TOPK.QUERY
Return is element in the Top-K structure.

### Parameters:
key: The key name of the Top-K structure.
item: The element to be queried.

### example:
```redis
TOPK.QUERY key item
```

### Implementation:
1. Read the metadata and the topk structure.
2. Check if the element is in heap of topk by comparing char* item.
3. Return 1 if the element is in the topk structure, otherwise return 0.


## TOPK.LIST
Return the list of elements in the Top-K structure.

### Parameters:
key: The key name of the Top-K structure.

### example:
```redis
TOPK.LIST key
```

### Implementation:
1. Read the metadata and the topk structure.
2. Return the list of elements in heap of topk.


## TOPK.INFO
Return the information of the Top-K structure.

### Parameters:
key: The key name of the Top-K structure.

### example:
```redis
TOPK.INFO key
```

### Implementation:
1. Read the metadata and parse it.
2. Return the metadata.

## TOPK.INCRBY
Increment the count of an element in the Top-K structure.

### Parameters:
key: The key name of the Top-K structure.
item: The element to be incremented.
count: The amount to increment the count by.

### example:
```redis
TOPK.INCRBY key item count
```

### Implementation:
similar to TOPK.ADD, but instead of adding one count of element to the topk 
structure, it increments the count of the element by your parameter.

# Reference
https://redis.io/docs/latest/develop/data-types/probabilistic/top-k/
https://redis.io/docs/latest/commands/?group=topk
https://www.usenix.org/conference/atc18/presentation/gong

GitHub link: https://github.com/apache/kvrocks/discussions/3259

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to