This is an automated email from the ASF dual-hosted git repository. critas pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/iotdb-docs.git
The following commit(s) were added to refs/heads/main by this push:
new daeb8fc4 add function: approx_most_frequent (#800)
daeb8fc4 is described below
commit daeb8fc436965690f4e55ad905f46efe30bdecac
Author: leto-b <[email protected]>
AuthorDate: Thu Dec 11 19:59:52 2025 +0800
add function: approx_most_frequent (#800)
* add function: approx_most_frequent
* add version
---
.../Master/Table/SQL-Manual/Basis-Function.md | 79 ++++++++++++++--------
.../latest-Table/SQL-Manual/Basis-Function.md | 33 +++++++--
.../Master/Table/SQL-Manual/Basis-Function.md | 36 ++++++++--
.../latest-Table/SQL-Manual/Basis-Function.md | 35 ++++++++--
4 files changed, 134 insertions(+), 49 deletions(-)
diff --git a/src/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
b/src/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
index b50e10dd..d8c96720 100644
--- a/src/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
+++ b/src/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
@@ -156,29 +156,30 @@ SELECT LEAST(temperature,humidity) FROM table2;
### 2.2 Supported Aggregate Functions
-| Function Name | Description
[...]
-|:-----------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-| COUNT | Counts the number of data points.
[...]
-| COUNT_IF | COUNT_IF(exp) counts the number of rows that
satisfy a specified boolean expression.
[...]
-| APPROX_COUNT_DISTINCT | The APPROX_COUNT_DISTINCT(x[, maxStandardError])
function provides an approximation of COUNT(DISTINCT x), returning the
estimated number of distinct input values.
[...]
-| SUM | Calculates the sum.
[...]
-| AVG | Calculates the average.
[...]
-| MAX | Finds the maximum value.
[...]
-| MIN | Finds the minimum value.
[...]
-| FIRST | Finds the value with the smallest timestamp that is
not NULL.
[...]
-| LAST | Finds the value with the largest timestamp that is
not NULL.
[...]
-| STDDEV | Alias for STDDEV_SAMP, calculates the sample
standard deviation.
[...]
-| STDDEV_POP | Calculates the population standard deviation.
[...]
-| STDDEV_SAMP | Calculates the sample standard deviation.
[...]
-| VARIANCE | Alias for VAR_SAMP, calculates the sample
variance.
[...]
-| VAR_POP | Calculates the population variance.
[...]
-| VAR_SAMP | Calculates the sample variance.
[...]
-| EXTREME | Finds the value with the largest absolute value. If
the largest absolute values of positive and negative values are equal, returns
the positive value.
[...]
-| MODE | Finds the mode. Note: 1. There is a risk of memory
exception when the number of distinct values in the input sequence is too
large; 2. If all elements have the same frequency, i.e., there is no mode, a
random element is returned; 3. If there are multiple modes, a random mode is
returned; 4. NULL values are also counted in frequency, so even if not all
values in the input sequence are NULL, the final result may still be NULL.
[...]
-| MAX_BY | MAX_BY(x, y) finds the value of x corresponding to
the maximum y in the binary input x and y. MAX_BY(time, x) returns the
timestamp when x is at its maximum.
[...]
-| MIN_BY | MIN_BY(x, y) finds the value of x corresponding to
the minimum y in the binary input x and y. MIN_BY(time, x) returns the
timestamp when x is at its minimum.
[...]
-| FIRST_BY | FIRST_BY(x, y) finds the value of x in the same row
when y is the first non-null value.
[...]
-| LAST_BY | LAST_BY(x, y) finds the value of x in the same row
when y is the last non-null value.
[...]
+| Function Name | Description
| Allowed Input Types
[...]
+|:-----------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------
[...]
+| COUNT | Counts the number of data points.
| All types
[...]
+| COUNT_IF | COUNT_IF(exp) counts the number of rows that
satisfy a specified boolean expression.
| `exp` must be
a boolean expression [...]
+| APPROX_COUNT_DISTINCT | The APPROX_COUNT_DISTINCT(x[, maxStandardError])
function provides an approximation of COUNT(DISTINCT x), returning the
estimated number of distinct input values.
| `x`: The
target column to be calcu [...]
+| APPROX_MOST_FREQUENT | The APPROX_MOST_FREQUENT(x, k, capacity) function is
used to approximately calculate the top k most frequent elements in a dataset.
It returns a JSON-formatted string where the keys are the element values and
the values are their corresponding approximate frequencies. (Available since
V2.0.5.1)
| `x` : The
column to be calculated, s [...]
+| SUM | Calculates the sum.
| INT32 INT64 FLOAT
DOUBLE [...]
+| AVG | Calculates the average.
| INT32 INT64 FLOAT
DOUBLE [...]
+| MAX | Finds the maximum value.
| All types
[...]
+| MIN | Finds the minimum value.
| All types
[...]
+| FIRST | Finds the value with the smallest timestamp that is
not NULL.
| All types
[...]
+| LAST | Finds the value with the largest timestamp that is
not NULL.
| All types
[...]
+| STDDEV | Alias for STDDEV_SAMP, calculates the sample
standard deviation.
| INT32 INT64
FLOAT DOUBLE [...]
+| STDDEV_POP | Calculates the population standard deviation.
| INT32 INT64 FLOAT
DOUBLE [...]
+| STDDEV_SAMP | Calculates the sample standard deviation.
| INT32 INT64 FLOAT
DOUBLE [...]
+| VARIANCE | Alias for VAR_SAMP, calculates the sample
variance.
| INT32 INT64
FLOAT DOUBLE [...]
+| VAR_POP | Calculates the population variance.
| INT32 INT64 FLOAT
DOUBLE [...]
+| VAR_SAMP | Calculates the sample variance.
| INT32 INT64 FLOAT
DOUBLE [...]
+| EXTREME | Finds the value with the largest absolute value. If
the largest absolute values of positive and negative values are equal, returns
the positive value.
| INT32 INT64 FLOAT
DOUBLE [...]
+| MODE | Finds the mode. Note: 1. There is a risk of memory
exception when the number of distinct values in the input sequence is too
large; 2. If all elements have the same frequency, i.e., there is no mode, a
random element is returned; 3. If there are multiple modes, a random mode is
returned; 4. NULL values are also counted in frequency, so even if not all
values in the input sequence are NULL, the final result may still be NULL. |
All types [...]
+| MAX_BY | MAX_BY(x, y) finds the value of x corresponding to
the maximum y in the binary input x and y. MAX_BY(time, x) returns the
timestamp when x is at its maximum.
| x and y
can be of any type [...]
+| MIN_BY | MIN_BY(x, y) finds the value of x corresponding to
the minimum y in the binary input x and y. MIN_BY(time, x) returns the
timestamp when x is at its minimum.
| x and y
can be of any type [...]
+| FIRST_BY | FIRST_BY(x, y) finds the value of x in the same row
when y is the first non-null value.
| x and y can be of
any type [...]
+| LAST_BY | LAST_BY(x, y) finds the value of x in the same row
when y is the last non-null value.
| x and y can be of
any type [...]
### 2.3 Examples
@@ -251,8 +252,28 @@ Total line number = 1
It costs 0.022s
```
+#### 2.3.5 Approx_most_frequent
-#### 2.3.5 First
+Query the top 2 most frequent values in the `temperature` column of
`table1`.
+
+```sql
+IoTDB> select approx_most_frequent(temperature,2,100) as topk from table1;
+```
+
+The execution result is as follows:
+
+```sql
++-------------------+
+| topk|
++-------------------+
+|{"85.0":6,"90.0":5}|
++-------------------+
+Total line number = 1
+It costs 0.064s
+```
+
+
+#### 2.3.6 First
Finds the values with the smallest timestamp that are not NULL in the
`temperature` and `humidity` columns.
@@ -272,7 +293,7 @@ Total line number = 1
It costs 0.170s
```
-#### 2.3.6 Last
+#### 2.3.7 Last
Finds the values with the largest timestamp that are not NULL in the
`temperature` and `humidity` columns.
@@ -292,7 +313,7 @@ Total line number = 1
It costs 0.211s
```
-#### 2.3.7 First_by
+#### 2.3.8 First_by
Finds the `time` value of the row with the smallest timestamp that is not NULL
in the `temperature` column, and the `humidity` value of the row with the
smallest timestamp that is not NULL in the `temperature` column.
@@ -312,7 +333,7 @@ Total line number = 1
It costs 0.269s
```
-#### 2.3.8 Last_by
+#### 2.3.9 Last_by
Queries the `time` value of the row with the largest timestamp that is not
NULL in the `temperature` column, and the `humidity` value of the row with the
largest timestamp that is not NULL in the `temperature` column.
@@ -332,7 +353,7 @@ Total line number = 1
It costs 0.070s
```
-#### 2.3.9 Max_by
+#### 2.3.10 Max_by
Queries the `time` value of the row where the `temperature` column is at its
maximum, and the `humidity` value of the row where the `temperature` column is
at its maximum.
@@ -352,7 +373,7 @@ Total line number = 1
It costs 0.172s
```
-#### 2.3.10 Min_by
+#### 2.3.11 Min_by
Queries the `time` value of the row where the `temperature` column is at its
minimum, and the `humidity` value of the row where the `temperature` column is
at its minimum.
diff --git a/src/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
b/src/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
index b50e10dd..65ba2014 100644
--- a/src/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
+++ b/src/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
@@ -161,6 +161,7 @@ SELECT LEAST(temperature,humidity) FROM table2;
| COUNT | Counts the number of data points.
[...]
| COUNT_IF | COUNT_IF(exp) counts the number of rows that
satisfy a specified boolean expression.
[...]
| APPROX_COUNT_DISTINCT | The APPROX_COUNT_DISTINCT(x[, maxStandardError])
function provides an approximation of COUNT(DISTINCT x), returning the
estimated number of distinct input values.
[...]
+| APPROX_MOST_FREQUENT | The APPROX_MOST_FREQUENT(x, k, capacity) function is
used to approximately calculate the top k most frequent elements in a dataset.
It returns a JSON-formatted string where the keys are the element values and
the values are their corresponding approximate frequencies. (Available since
V2.0.5.1) | `x` : The column to be calculated, supporting all existing data
types in IoTDB;<br> `k`: The number of top-k most frequent values to
return;<br>`capacity`: The number of [...]
| SUM | Calculates the sum.
[...]
| AVG | Calculates the average.
[...]
| MAX | Finds the maximum value.
[...]
@@ -251,8 +252,28 @@ Total line number = 1
It costs 0.022s
```
+#### 2.3.5 Approx_most_frequent
-#### 2.3.5 First
+Query the top 2 most frequent values in the `temperature` column of
`table1`.
+
+```sql
+IoTDB> select approx_most_frequent(temperature,2,100) as topk from table1;
+```
+
+The execution result is as follows:
+
+```sql
++-------------------+
+| topk|
++-------------------+
+|{"85.0":6,"90.0":5}|
++-------------------+
+Total line number = 1
+It costs 0.064s
+```
+
+
+#### 2.3.6 First
Finds the values with the smallest timestamp that are not NULL in the
`temperature` and `humidity` columns.
@@ -272,7 +293,7 @@ Total line number = 1
It costs 0.170s
```
-#### 2.3.6 Last
+#### 2.3.7 Last
Finds the values with the largest timestamp that are not NULL in the
`temperature` and `humidity` columns.
@@ -292,7 +313,7 @@ Total line number = 1
It costs 0.211s
```
-#### 2.3.7 First_by
+#### 2.3.8 First_by
Finds the `time` value of the row with the smallest timestamp that is not NULL
in the `temperature` column, and the `humidity` value of the row with the
smallest timestamp that is not NULL in the `temperature` column.
@@ -312,7 +333,7 @@ Total line number = 1
It costs 0.269s
```
-#### 2.3.8 Last_by
+#### 2.3.9 Last_by
Queries the `time` value of the row with the largest timestamp that is not
NULL in the `temperature` column, and the `humidity` value of the row with the
largest timestamp that is not NULL in the `temperature` column.
@@ -332,7 +353,7 @@ Total line number = 1
It costs 0.070s
```
-#### 2.3.9 Max_by
+#### 2.3.10 Max_by
Queries the `time` value of the row where the `temperature` column is at its
maximum, and the `humidity` value of the row where the `temperature` column is
at its maximum.
@@ -352,7 +373,7 @@ Total line number = 1
It costs 0.172s
```
-#### 2.3.10 Min_by
+#### 2.3.11 Min_by
Queries the `time` value of the row where the `temperature` column is at its
minimum, and the `humidity` value of the row where the `temperature` column is
at its minimum.
diff --git a/src/zh/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
b/src/zh/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
index c55ba021..6ed554c8 100644
--- a/src/zh/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
+++ b/src/zh/UserGuide/Master/Table/SQL-Manual/Basis-Function.md
@@ -159,7 +159,8 @@ SELECT LEAST(temperature,humidity) FROM table2;
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|------------------|
| COUNT | 计算数据点数。
| 所有类型
| INT64 |
| COUNT_IF | COUNT_IF(exp) 用于统计满足指定布尔表达式的记录行数
| exp 必须是一个布尔类型的表达式,例如 count_if(temperature>20)
| INT64 |
-| APPROX_COUNT_DISTINCT | APPROX_COUNT_DISTINCT(x[,maxStandardError]) 函数提供
COUNT(DISTINCT x) 的近似值,返回不同输入值的近似个数。 |
x:待计算列,支持所有类型;<br> maxStandardError:指定该函数应产生的最大标准误差,取值范围[0.0040625,
0.26],未指定值时默认0.023。 | INT64 |
+| APPROX_COUNT_DISTINCT | APPROX_COUNT_DISTINCT(x[,maxStandardError]) 函数提供
COUNT(DISTINCT x) 的近似值,返回不同输入值的近似个数。
| `x`:待计算列,支持所有类型;<br>
`maxStandardError`:指定该函数应产生的最大标准误差,取值范围[0.0040625, 0.26],未指定值时默认0.023。 | INT64
|
+| APPROX_MOST_FREQUENT | APPROX_MOST_FREQUENT(x, k, capacity)
函数用于近似计算数据集中出现频率最高的前 k 个元素。它返回一个JSON 格式的字符串,其中键是该元素的值,值是该元素对应的近似频率。(V 2.0.5.1
及以后版本支持) | `x`:待计算列,支持 IoTDB 现有所有的数据类型;<br> `k`:返回出现频率最高的 k
个值;<br> `capacity`:
用于计算的桶的数量,跟内存占用相关:其值越大误差越小,但占用内存更大,反之capacity值越小误差越大,但占用内存更小。 | STRING |
| SUM | 求和。
| INT32 INT64 FLOAT DOUBLE
| DOUBLE |
| AVG | 求平均值。
| INT32 INT64 FLOAT DOUBLE
| DOUBLE |
| MAX | 求最大值。
| 所有类型
| 与输入类型一致 |
@@ -251,7 +252,28 @@ It costs 0.022s
```
-#### 2.3.5 First
+#### 2.3.5 Approx_most_frequent
+
+查询 `table1` 中 `temperature` 列出现频次最高的2个值
+
+```sql
+IoTDB> select approx_most_frequent(temperature,2,100) as topk from table1;
+```
+
+执行结果如下:
+
+```sql
++-------------------+
+| topk|
++-------------------+
+|{"85.0":6,"90.0":5}|
++-------------------+
+Total line number = 1
+It costs 0.064s
+```
+
+
+#### 2.3.6 First
查询`temperature`列、`humidity`列时间戳最小且不为 NULL 的值。
@@ -271,7 +293,7 @@ Total line number = 1
It costs 0.170s
```
-#### 2.3.6 Last
+#### 2.3.7 Last
查询`temperature`列、`humidity`列时间戳最大且不为 NULL 的值。
@@ -291,7 +313,7 @@ Total line number = 1
It costs 0.211s
```
-#### 2.3.7 First_by
+#### 2.3.8 First_by
查询 `temperature` 列中非 NULL 且时间戳最小的行的 `time` 值,以及 `temperature` 列中非 NULL
且时间戳最小的行的 `humidity` 值。
@@ -311,7 +333,7 @@ Total line number = 1
It costs 0.269s
```
-#### 2.3.8 Last_by
+#### 2.3.9 Last_by
查询`temperature` 列中非 NULL 且时间戳最大的行的 `time` 值,以及 `temperature` 列中非 NULL
且时间戳最大的行的 `humidity` 值。
@@ -331,7 +353,7 @@ Total line number = 1
It costs 0.070s
```
-#### 2.3.9 Max_by
+#### 2.3.10 Max_by
查询`temperature` 列中最大值所在行的 `time` 值,以及`temperature` 列中最大值所在行的 `humidity` 值。
@@ -351,7 +373,7 @@ Total line number = 1
It costs 0.172s
```
-#### 2.3.10 Min_by
+#### 2.3.11 Min_by
查询`temperature` 列中最小值所在行的 `time` 值,以及`temperature` 列中最小值所在行的 `humidity` 值。
diff --git a/src/zh/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
b/src/zh/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
index c55ba021..219d6820 100644
--- a/src/zh/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
+++ b/src/zh/UserGuide/latest-Table/SQL-Manual/Basis-Function.md
@@ -159,7 +159,8 @@ SELECT LEAST(temperature,humidity) FROM table2;
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|------------------|
| COUNT | 计算数据点数。
| 所有类型
| INT64 |
| COUNT_IF | COUNT_IF(exp) 用于统计满足指定布尔表达式的记录行数
| exp 必须是一个布尔类型的表达式,例如 count_if(temperature>20)
| INT64 |
-| APPROX_COUNT_DISTINCT | APPROX_COUNT_DISTINCT(x[,maxStandardError]) 函数提供
COUNT(DISTINCT x) 的近似值,返回不同输入值的近似个数。 |
x:待计算列,支持所有类型;<br> maxStandardError:指定该函数应产生的最大标准误差,取值范围[0.0040625,
0.26],未指定值时默认0.023。 | INT64 |
+| APPROX_COUNT_DISTINCT | APPROX_COUNT_DISTINCT(x[,maxStandardError]) 函数提供
COUNT(DISTINCT x) 的近似值,返回不同输入值的近似个数。 |
`x`:待计算列,支持所有类型;<br> `maxStandardError`:指定该函数应产生的最大标准误差,取值范围[0.0040625,
0.26],未指定值时默认0.023。 | INT64 |
+| APPROX_MOST_FREQUENT | APPROX_MOST_FREQUENT(x, k, capacity)
函数用于近似计算数据集中出现频率最高的前 k 个元素。它返回一个JSON 格式的字符串,其中键是该元素的值,值是该元素对应的近似频率。(V 2.0.5.1
及以后版本支持) | `x`:待计算列,支持 IoTDB 现有所有的数据类型;<br> `k`:返回出现频率最高的 k 个值;<br>
`capacity`: 用于计算的桶的数量,跟内存占用相关:其值越大误差越小,但占用内存更大,反之capacity值越小误差越大,但占用内存更小。 |
STRING |
| SUM | 求和。
| INT32 INT64 FLOAT DOUBLE
| DOUBLE |
| AVG | 求平均值。
| INT32 INT64 FLOAT DOUBLE
| DOUBLE |
| MAX | 求最大值。
| 所有类型
| 与输入类型一致 |
@@ -250,8 +251,28 @@ Total line number = 1
It costs 0.022s
```
+#### 2.3.5 Approx_most_frequent
-#### 2.3.5 First
+查询 `table1` 中 `temperature` 列出现频次最高的2个值
+
+```sql
+IoTDB> select approx_most_frequent(temperature,2,100) as topk from table1;
+```
+
+执行结果如下:
+
+```sql
++-------------------+
+| topk|
++-------------------+
+|{"85.0":6,"90.0":5}|
++-------------------+
+Total line number = 1
+It costs 0.064s
+```
+
+
+#### 2.3.6 First
查询`temperature`列、`humidity`列时间戳最小且不为 NULL 的值。
@@ -271,7 +292,7 @@ Total line number = 1
It costs 0.170s
```
-#### 2.3.6 Last
+#### 2.3.7 Last
查询`temperature`列、`humidity`列时间戳最大且不为 NULL 的值。
@@ -291,7 +312,7 @@ Total line number = 1
It costs 0.211s
```
-#### 2.3.7 First_by
+#### 2.3.8 First_by
查询 `temperature` 列中非 NULL 且时间戳最小的行的 `time` 值,以及 `temperature` 列中非 NULL
且时间戳最小的行的 `humidity` 值。
@@ -311,7 +332,7 @@ Total line number = 1
It costs 0.269s
```
-#### 2.3.8 Last_by
+#### 2.3.9 Last_by
查询`temperature` 列中非 NULL 且时间戳最大的行的 `time` 值,以及 `temperature` 列中非 NULL
且时间戳最大的行的 `humidity` 值。
@@ -331,7 +352,7 @@ Total line number = 1
It costs 0.070s
```
-#### 2.3.9 Max_by
+#### 2.3.10 Max_by
查询`temperature` 列中最大值所在行的 `time` 值,以及`temperature` 列中最大值所在行的 `humidity` 值。
@@ -351,7 +372,7 @@ Total line number = 1
It costs 0.172s
```
-#### 2.3.10 Min_by
+#### 2.3.11 Min_by
查询`temperature` 列中最小值所在行的 `time` 值,以及`temperature` 列中最小值所在行的 `humidity` 值。
