[iotdb] branch master updated: [IOTDB-4718] Update user guide of SELECT INTO statement (#7771)

hui Fri, 28 Oct 2022 02:05:47 -0700

This is an automated email from the ASF dual-hosted git repository.

hui pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iotdb.git



The following commit(s) were added to refs/heads/master by this push:
     new 2536fda09b [IOTDB-4718] Update user guide of SELECT INTO statement 
(#7771)
2536fda09b is described below

commit 2536fda09b47b0936dded6caed3ae924ab7970fd
Author: liuminghui233 <[email protected]>
AuthorDate: Fri Oct 28 17:05:29 2022 +0800

    [IOTDB-4718] Update user guide of SELECT INTO statement (#7771)
---
 docs/UserGuide/Process-Data/Select-Into.md    | 425 +++++++++++++++-----------
 docs/zh/UserGuide/Process-Data/Select-Into.md | 425 ++++++++++++++++----------
 2 files changed, 507 insertions(+), 343 deletions(-)

diff --git a/docs/UserGuide/Process-Data/Select-Into.md 
b/docs/UserGuide/Process-Data/Select-Into.md
index 1754d10588..96e01bf783 100644
--- a/docs/UserGuide/Process-Data/Select-Into.md
+++ b/docs/UserGuide/Process-Data/Select-Into.md
@@ -19,242 +19,319 @@
 
 -->
 
-
-
 # Query Write-back (SELECT INTO)
 
-The `SELECT ... INTO ...` statement copies data from query result set into 
target time series.
-
+The `SELECT INTO` statement copies data from query result set into target time 
series.
 
+The application scenarios are as follows:
+- **Implement IoTDB internal ETL**: ETL the original data and write a new time 
series.
+- **Query result storage**: Persistently store the query results, which acts 
like a materialized view.
+- **Non-aligned time series to aligned time series**:  Rewrite non-aligned 
time series into aligned time series.
 
-## SQL
+## SQL Syntax
 
-### Syntax
+### Syntax Definition
 
 **The following is the syntax definition of the `select` statement:**
 
 ```sql
-selectClause 
-intoClause? 
-fromClause 
-whereClause? 
-specialClause?
-```
-
-If the `intoClause` is removed, then the `select` statement is a simple query 
statement.
-
-The `intoClause` is the mark clause for query write-back.
-
-
-
-**The following is the definition of the `intoClause`:**
-
-```sql
-intoClause
-  : INTO ALIGNED? intoPath (COMMA intoPath)*
-  ;
-
-intoPath
-  : fullPath
-  | nodeNameWithoutStar (DOT nodeNameWithoutStar)*
-  ;
+selectIntoStatement
+: SELECT
+      resultColumn [, resultColumn] ...
+        INTO intoItem [, intoItem] ...
+        FROM prefixPath [, prefixPath] ...
+        [WHERE whereCondition]
+      [GROUP BY groupByTimeClause, groupByLevelClause]
+      [FILL {PREVIOUS | LINEAR | constant}]
+      [LIMIT rowLimit OFFSET rowOffset]
+      [ALIGN BY DEVICE]
+;
+
+intoItem
+: [ALIGNED] intoDevicePath '(' intoMeasurementName [',' intoMeasurementName]* 
')'
+    ;
 ```
 
-There are 2 ways to specify target paths (`intoPath`).
-
-*  Full target series name starting with `root`.
-
-  * Example：
-
-    ```sql
-    select s1, s1 
-    into root.sg.d1.t1, root.sg.d1.t2 
-    from root.sg.d1
-    ```
-
-* Suffix path does not start with `root`. In this case, the target series name 
equals to the series prefix path in the `from` clause  +   the suffix path.
-
-  * Example:
-
-    ```sql
-    select s1, s1 
-    into t1, t2 
-    from root.sg.d1
-    ```
-
-    which equals to:
-
-    ```sql
-    select s1, s1 
-    into root.sg.d1.t1, root.sg.d1.t2 
-    from root.sg.d1
-    ```
-
-
+### `INTO` Clause
 
-**In `intoPath`, you can also use `${i}` to represent part of the prefix path 
in the `from` clause. **
+The `INTO` clause consists of several `intoItem`.
 
-For example, for the path `root.sg1.d1.v1`,  `${1}` means `sg1`,  `${2}` means 
`d1`, and `${3}` means `v1`.
+Each `intoItem` consists of a target device and a list of target measurements 
(similar to the `INTO` clause in an `INSERT` statement).
 
+Each target measurement and device form a target time series, and an 
`intoItem` contains a series of time series. For example: `root.sg_copy.d1(s1, 
s2)` specifies two target time series `root.sg_copy.d1.s1` and 
`root.sg_copy.d1.s2`.
 
-  * Example:
+The target time series specified by the `INTO` clause must correspond 
one-to-one with the columns of the query result set. The specific rules are as 
follows:
 
-    ```sql
-    select s1, s1, s1
-    into ${1}_t1, ${2}, root.${2}.${1}.t2
-    from root.sg.d1
-    ```
+- **Align by time** (default): The number of target time series contained in 
all `intoItem` must be consistent with the number of columns in the query 
result set (except the time column) and correspond one-to-one in the order from 
left to right in the header.
+- **Align by device** (using `ALIGN BY DEVICE`): the number of target devices 
specified in all `intoItem` is the same as the number of devices queried (i.e., 
the number of devices matched by the path pattern in the `FROM` clause), and 
One-to-one correspondence according to the output order of the result set 
device.
+  <br>The number of measurements specified for each target device should be 
consistent with the number of columns in the query result set (except for the 
time and device columns). It should be in one-to-one correspondence from left 
to right in the header.
 
-    which equals to:
+For examples:
 
-    ```sql
-    select s1, s1, s1
-    into root.sg.d1.sg_t1, root.sg.d1.d1, root.d1.sg.t2
-    from root.sg.d1
-    ```
-
-
-
-**You can specify whether the target timeseries are aligned via the keyword 
`ALIGNED`. **
-
-When the target aligned timeseries are existed, you need to ensure that the 
types of the source and target time series match.
-
-When the target aligned timeseries are not existed, the system will 
automatically create the target aligned time series.
+- **Example 1** (aligned by time)
+```shell
+IoTDB> select s1, s2 into root.sg_copy.d1(t1), root.sg_copy.d2(t1, t2), 
root.sg_copy.d1(t2) from root.sg.d1, root.sg.d2;
++--------------+-------------------+--------+
+| source column|  target timeseries| written|
++--------------+-------------------+--------+
+| root.sg.d1.s1| root.sg_copy.d1.t1|    8000|
++--------------+-------------------+--------+
+| root.sg.d2.s1| root.sg_copy.d2.t1|   10000|
++--------------+-------------------+--------+
+| root.sg.d1.s2| root.sg_copy.d2.t2|   12000|
++--------------+-------------------+--------+
+| root.sg.d2.s2| root.sg_copy.d1.t2|   10000|
++--------------+-------------------+--------+
+Total line number = 4
+It costs 0.725s
+```
+This statement writes the query results of the four time series under the 
`root.sg` storage group to the four specified time series under the 
`root.sg_copy` storage group. Note that `root.sg_copy.d2(t1, t2)` can also be 
written as `root.sg_copy.d2(t1), root.sg_copy.d2(t2)`.
+
+We can see that the writing of the `INTO` clause is very flexible as long as 
the combined target time series is not repeated and corresponds to the query 
result column one-to-one.
+
+> In the result set displayed by `CLI`, the meaning of each column is as 
follows:
+> - The `source column` column represents the column name of the query result.
+> - `target timeseries` represents the target time series for the 
corresponding column to write.
+> - `written` indicates the amount of data expected to be written.
+
+
+- **Example 2** (aligned by time)
+```shell
+IoTDB> select count(s1 + s2), last_value(s2) into root.agg.count(s1_add_s2), 
root.agg.last_value(s2) from root.sg.d1 group by ([0, 100), 10ms);
++--------------------------------------+-------------------------+--------+
+|                         source column|        target timeseries| written|
++--------------------------------------+-------------------------+--------+
+|  count(root.sg.d1.s1 + root.sg.d1.s2)| root.agg.count.s1_add_s2|      10|
++--------------------------------------+-------------------------+--------+
+|             last_value(root.sg.d1.s2)|   root.agg.last_value.s2|      10|
++--------------------------------------+-------------------------+--------+
+Total line number = 2
+It costs 0.375s
+```
 
+This statement stores the results of an aggregated query into the specified 
time series.
+
+- **Example 3** (aligned by device)
+```shell
+IoTDB> select s1, s2 into root.sg_copy.d1(t1, t2), root.sg_copy.d2(t1, t2) 
from root.sg.d1, root.sg.d2 align by device;
++--------------+--------------+-------------------+--------+
+| source device| source column|  target timeseries| written|
++--------------+--------------+-------------------+--------+
+|    root.sg.d1|            s1| root.sg_copy.d1.t1|    8000|
++--------------+--------------+-------------------+--------+
+|    root.sg.d1|            s2| root.sg_copy.d1.t2|   11000|
++--------------+--------------+-------------------+--------+
+|    root.sg.d2|            s1| root.sg_copy.d2.t1|   12000|
++--------------+--------------+-------------------+--------+
+|    root.sg.d2|            s2| root.sg_copy.d2.t2|    9000|
++--------------+--------------+-------------------+--------+
+Total line number = 4
+It costs 0.625s
+```
+This statement also writes the query results of the four time series under the 
`root.sg` storage group to the four specified time series under the 
`root.sg_copy` storage group. However, in ALIGN BY DEVICE, the number of 
`intoItem` must be the same as the number of queried devices, and each queried 
device corresponds to one `intoItem`.
+
+It is easy to see that in the case of ALIGN BY DEVICE query, columns under the 
same device can only be written to the same target device.
+
+> When aligning the query by device, the result set displayed by `CLI` has one 
more column, the `source device` column indicating the queried device.
+
+- **Example 4** (aligned by device)
+```shell
+IoTDB> select s1 + s2 into root.expr.add(d1s1_d1s2), root.expr.add(d2s1_d2s2) 
from root.sg.d1, root.sg.d2 align by device;
++--------------+--------------+------------------------+--------+
+| source device| source column|       target timeseries| written|
++--------------+--------------+------------------------+--------+
+|    root.sg.d1|       s1 + s2| root.expr.add.d1s1_d1s2|   10000|
++--------------+--------------+------------------------+--------+
+|    root.sg.d2|       s1 + s2| root.expr.add.d2s1_d2s2|   10000|
++--------------+--------------+------------------------+--------+
+Total line number = 2
+It costs 0.532s
+```
+This statement stores the result of evaluating an expression into the 
specified time series.
 
-   * Example:
+### Using variable placeholders
 
-     ```sql
-     select s1, s2, s3
-     into aligned root.sg.d2.t1, root.sg.d2.t2, root.sg.d2.t3
-     from root.sg.d1
-     ````
+In particular, We can use variable placeholders to describe the correspondence 
between the target and query time series, simplifying the statement. The 
following two variable placeholders are currently supported:
 
+- Suffix duplication character `::`: Copy the suffix (or measurement) of the 
query device, indicating that from this layer to the last layer (or 
measurement) of the device, the node name (or measurement) of the target device 
corresponds to the queried device The node name (or measurement) is the same.
+- Single-level node matcher `${i}`: Indicates that the current level node name 
of the target sequence is the same as the i-th level node name of the query 
sequence. For example, for the path `root.sg1.d1.s1`, `${1}` means `sg1`, 
`${2}` means `d1`, and `${3}` means `s1`.
 
+Note: The variable placeholder **can only describe the correspondence between 
time series**, and cannot be used for functions, expressions, etc.
 
-### Supported Query Types
+When using variable placeholders, there must be no ambiguity in the 
correspondence between `intoItem` and the columns of the query result set. The 
specific cases are classified as follows:
 
-**Note that except for the following types of queries, other types of queries 
(such as `LAST` queries and raw aggregation queries) are not supported. **
+#### ALIGN BY TIME (default)
 
-* Raw time series query
+> Note: If the query includes aggregation and expression calculation, the 
columns in the query result cannot correspond to a time series, so neither the 
target device nor the measurement can use variable placeholders.
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  ```
+##### (1) The target device does not use variable placeholders & the target 
measurement list uses variable placeholders
 
-* Time series generating function query（including UDF query）
+**Limitations:**
+1. In each `intoItem`, the length of the list of physical quantities must be 
1. <br> (If the length can be greater than 1, e.g. `root.sg1.d1(::, s1)`, it is 
not possible to determine which columns match `::`)
+2. The number of `intoItem` is 1, or the same as the number of columns in the 
query result set. <br>(When the length of each target measurement list is 1, if 
there is only one `intoItem`, it means that all the query sequences are written 
to the same device; if the number of `intoItem` is consistent with the query 
sequence, it is expressed as each query time series specifies a target device; 
if `intoItem` is greater than one and less than the number of query sequences, 
it cannot be a one- [...]
 
-  ```sql
-  select s1, sin(s2) 
-  into t1, t2 
-  from root.sg.d1
-  ```
+**Matching method:** Each query time series specifies the target device, and 
the target measurement is generated from the variable placeholder.
 
-* Arithmetic query
+**Example:**
 
-  ```sql
-  select s1, sin(s2), s1 + s3 
-  into t1, t2, t3 
-  from root.sg.d1
-  ```
+```sql
+select s1, s2
+into root.sg_copy.d1(::), rot.sg_copy.d2(s1), root.sg_copy.d1(${3}), 
root.sg_copy.d2(::),
+from root.sg.d1, root.sg.d2;
+````
+This statement is equivalent to:
+```sql
+select s1, s2
+into root.sg_copy.d1(s1), rot.sg_copy.d2(s1), root.sg_copy.d1(s2), 
root.sg_copy.d2(s2),
+from root.sg.d1, root.sg.d2;
+````
+As you can see, the statement is not very simplified in this case.
 
-* Nested query
+##### (2) The target device uses variable placeholders & the target 
measurement list does not use variable placeholders
 
-  ```sql
-  select -s1, sin(cos(tan(s1 + s2 * s3))) + cos(s3), top_k(s1 + s3, 'k'='1') 
-  into t1, t2, t3 
-  from root.sg.d1
-  ```
-  
-* Fill query
+**Limitations:** The number of target measurements in all `intoItem` is the 
same as the number of columns in the query result set.
 
-  ```sql
-  select s1 
-  into fill_s1 
-  from root.sg.d1 
-  where time = 10 
-  fill(float [linear, 1ms, 1ms])
-  ```
+**Matching method:** The target measurement is specified for each query time 
series, and the target device is generated according to the target device 
placeholder of the `intoItem` where the corresponding target measurement is 
located.
 
-* Group-by query
+**Example:**
+```sql
+select d1.s1, d1.s2, d2.s3, d3.s4
+into ::(s1_1, s2_2), root.sg.d2_2(s3_3), root.${2}_copy.::(s4)
+from root.sg;
+````
+##### (3) The target device uses variable placeholders & the target 
measurement list uses variable placeholders
 
-  ```sql
-  select count(s1) 
-  into group_by_s1 
-  from root.sg.d1 
-  group by ([1, 5), 1ms)
-  ```
+**Limitations:** There is only one `intoItem`, and the length of the list of 
measurement list is 1.
 
-* Group-by-fill query
+**Matching method:** Each query time series can get a target time series 
according to the variable placeholder.
 
-  ```sql
-  select last_value(s1) 
-  into group_by_fill_s1 
-  from root.sg.d1 
-  group by ([1, 10),1ms) 
-  fill (float[PREVIOUS])
-  ```
+**Example:**
+```sql
+select * into root.sg_bk.::(::) from root.sg.**;
+````
+Write the query results of all time series under `root.sg` to `root.sg_bk`, 
the device name suffix and measurement remain unchanged.
 
+#### ALIGN BY DEVICE
 
+> Note: If the query includes aggregation and expression calculation, the 
columns in the query result cannot correspond to a specific physical quantity, 
so the target measurement cannot use variable placeholders.
 
-### Special Cluases Supported in Queries
+##### (1) The target device does not use variable placeholders & the target 
measurement list uses variable placeholders
 
-**Note that except for the following clauses, other query clauses (such as 
`DESC`, `SOFFSET`, etc.) are not supported. **
+**Limitations:** In each `intoItem`, if the list of measurement uses variable 
placeholders, the length of the list must be 1.
 
-* Value filter
+**Matching method:** Each query time series specifies the target device, and 
the target measurement is generated from the variable placeholder.
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  where s1 > 0 and s2 < 0
-  ```
+**Example:**
+```sql
+select s1, s2, s3, s4
+into root.backup_sg.d1(s1, s2, s3, s4), root.backup_sg.d2(::), 
root.sg.d3(backup_${4})
+from root.sg.d1, root.sg.d2, root.sg.d3
+align by device;
+````
 
-* Time filter
+##### (2) The target device uses variable placeholders & the target 
measurement list does not use variable placeholders
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  where time > 0
-  ```
+**Limitations:** There is only one `intoItem`. (If there are multiple 
`intoItem` with placeholders, we will not know which source devices each 
`intoItem` needs to match)
 
-* LIMIT / OFFSET
+**Matching method:** Each query device obtains a target device according to 
the variable placeholder, and the target measurement written in each column of 
the result set under each device is specified by the target measurement list.
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  limit 5 offset 1000
-  ```
+**Example:**
+```sql
+select avg(s1), sum(s2) + sum(s3), count(s4)
+into root.agg_${2}.::(avg_s1, sum_s2_add_s3, count_s4)
+from root.**
+align by device;
+````
 
+##### (3) The target device uses variable placeholders & the target 
measurement list uses variable placeholders
 
+**Limitations:** There is only one `intoItem` and the length of the target 
measurement list is 1.
 
-### Other Restrictions
+**Matching method:** Each query time series can get a target time series 
according to the variable placeholder.
 
-* The number of source series in the `select` clause and the number of target 
series in the `into` clause must be the same.
-* The `select *` and `select **` clause are not supported.
-* The target series in the `into` clause do not need to be created in advance. 
When the target series in the `into` clause already exist, you need to ensure 
that the source series in the `select` clause and the target series in the 
`into` clause have the same data types.
-* The target series in the `into` clause must be different from each other.
-* Only one prefix path of a series is allowed in the `from` clause.
-* `*` and `**` are not allowed in the `from` clause.
-* Aligned Timeseries has not been supported in Time series generating function 
query（including UDF query）/ Arithmetic query / Nested query yet. An error 
message is expected if you use these types of query with Aligned Timeseries 
selected in the `select` clause.
+**Example:**
+```sql
+select * into ::(backup_${4}) from root.sg.** align by device;
+````
+Write the query result of each time series in `root.sg` to the same device, 
and add `backup_` before the measurement.
+
+### Specify the target time series as the aligned time series
+
+We can use the `ALIGNED` keyword to specify the target device for writing to 
be aligned, and each `intoItem` can be set independently.
+
+### Unsupported query clauses
+
+- `SLIMIT`, `SOFFSET`: The query columns are uncertain, so they are not 
supported.
+- `LAST`, `GROUP BY TAGS`, `DISABLE ALIGN`: The table structure is 
inconsistent with the writing structure, so it is not supported.
+
+### Other points to note
+
+- For general aggregation queries, the timestamp is meaningless, and the 
convention is to use 0 to store.
+- When the target time series exists, the metadata information such as the 
data type, compression,  encoding, and whether it belongs to the aligned device 
of the source time series and the target time series must be consistent.
+- When the target time series does not exist, the system automatically creates 
it (including the storage group).
+- When the queried time series does not exist, or the queried sequence does 
not have data, the target time series will not be created automatically.
+
+## Application examples
+
+### Implement IoTDB internal ETL
+ETL the original data and write a new time series.
+```shell
+IOTDB > SELECT preprocess_udf(*) INTO ::(preprocessed_${3}) FROM root.sg.*;
++-------------------------------+---------------------------+--------+
+|                  source column|          target timeseries| written|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d1.s1)| root.sg.d1.preprocessed_s1|    8000|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d1.s2)| root.sg.d1.preprocessed_s1|   10000|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d2.s1)| root.sg.d2.preprocessed_s1|   11000|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d2.s2)| root.sg.d2.preprocessed_s1|    9000|
++-------------------------------+---------------------------+--------+
+```
 
+### Query result storage
+Persistently store the query results, which acts like a materialized view.
+```shell
+IOTDB > SELECT count(s1), last_value(s1) INTO root.sg.agg_${2}(count_s1, 
last_value_s1) FROM root.sg1.d1 GROUP BY ([0, 10000), 10ms);
++--------------------------+-----------------------------+--------+
+|             source column|            target timeseries| written|
++--------------------------+-----------------------------+--------+
+|      count(root.sg.d1.s1)|      root.sg.agg_d1.count_s1|    1000|
++--------------------------+-----------------------------+--------+
+| last_value(root.sg.d1.s2)| root.sg.agg_d1.last_value_s2|    1000|
++--------------------------+-----------------------------+--------+
+Total line number = 2
+It costs 0.115s
+```
 
+### Non-aligned time series to aligned time series
+Rewrite non-aligned time series into aligned time series.
+
+**Note:** It is recommended to use the `LIMIT & OFFSET` clause or the `WHERE` 
clause (time filter) to batch data to prevent excessive data volume in a single 
operation.
+
+```shell
+IOTDB > SELECT s1, s2 INTO ALIGNED root.sg1.aligned_d(s1, s2) FROM 
root.sg1.non_aligned_d WHERE time >= 0 and time < 10000;
++--------------------------+----------------------+--------+
+|             source column|     target timeseries| written|
++--------------------------+----------------------+--------+
+| root.sg1.non_aligned_d.s1| root.sg1.aligned_d.s1|   10000|
++--------------------------+----------------------+--------+
+| root.sg1.non_aligned_d.s2| root.sg1.aligned_d.s2|   10000|
++--------------------------+----------------------+--------+
+Total line number = 2
+It costs 0.375s
+```
 
 ## User Permission Management
 
 The user must have the following permissions to execute a query write-back 
statement:
 
-* All `READ_TIMESERIES` permissions for the source series in the `select` 
clause
-* All `INSERT_TIMESERIES` permissions for the target series in the `into` 
clause
+* All `READ_TIMESERIES` permissions for the source series in the `select` 
clause.
+* All `INSERT_TIMESERIES` permissions for the target series in the `into` 
clause.
 
 For more user permissions related content, please refer to [Account Management 
Statements](../Administration-Management/Administration.md).
 
-
-
 ## Configurable Properties
 
 * `select_into_insert_tablet_plan_row_limit`: The maximum number of rows can 
be processed in one insert-tablet-plan when executing select-into statements. 
10000 by default.
diff --git a/docs/zh/UserGuide/Process-Data/Select-Into.md 
b/docs/zh/UserGuide/Process-Data/Select-Into.md
index 6f9afc64e6..3d0a57299a 100644
--- a/docs/zh/UserGuide/Process-Data/Select-Into.md
+++ b/docs/zh/UserGuide/Process-Data/Select-Into.md
@@ -19,243 +19,330 @@
 
 -->
 
+# 查询写回（SELECT INTO）
 
+`SELECT INTO` 语句用于将查询结果写入一系列指定的时间序列中。
 
-# 查询写回（SELECT ... INTO ...）
+应用场景如下：
+- **实现 IoTDB 内部 ETL**：对原始数据进行 ETL 处理后写入新序列。
+- **查询结果存储**：将查询结果进行持久化存储，起到类似物化视图的作用。
+- **非对齐序列转对齐序列**：对齐序列从0.13版本开始支持，可以通过该功能将历史的非对齐序列重新写成对齐序列。
 
-`SELECT ... INTO ...` 语句允许您将查询结果集写回到指定序列上。
+## 语法定义
 
-
-
-## SQL
-
-### 语法
-
-**下面是 `select` 语句的语法定义：**
-
-```sql
-selectClause 
-intoClause? 
-fromClause 
-whereClause? 
-specialClause?
-```
-
-如果去除 `intoClause` 子句，那么 `select` 语句即是单纯的查询语句。
-
-`intoClause` 子句是写回功能的标记语句。
-
-
-
-**下面是 `intoClause` 子句的定义：**
+### 整体描述
 
 ```sql
-intoClause
-  : INTO ALIGNED? intoPath (COMMA intoPath)*
-  ;
-
-intoPath
-  : fullPath
-  | nodeNameWithoutStar (DOT nodeNameWithoutStar)*
-  ;
+selectIntoStatement
+    : SELECT
+        resultColumn [, resultColumn] ...
+        INTO intoItem [, intoItem] ...
+        FROM prefixPath [, prefixPath] ...
+        [WHERE whereCondition]
+        [GROUP BY groupByTimeClause, groupByLevelClause]
+        [FILL {PREVIOUS | LINEAR | constant}]
+        [LIMIT rowLimit OFFSET rowOffset]
+        [ALIGN BY DEVICE]
+    ;
+
+intoItem
+    : [ALIGNED] intoDevicePath '(' intoMeasurementName [',' 
intoMeasurementName]* ')'
+    ;
 ```
 
-`intoPath`（目标序列）支持两种方式指定：
-
-* 以 `root` 开头的完整序列名指定
+### `INTO` 子句
 
-  * 例子：
+`INTO` 子句由若干个 `intoItem` 构成。
 
-    ```sql
-    select s1, s1 
-    into root.sg.d1.t1, root.sg.d1.t2 
-    from root.sg.d1
-    ```
+每个 `intoItem` 由一个目标设备路径和一个包含若干目标物理量名的列表组成（与 `INSERT` 语句中的 `INTO` 子句写法类似）。
 
-* 不以 `root` 开头的部分序列名指定，此时目标序列由 `from` 子句中的序列前缀和`intoPath`拼接而成
+其中每个目标物理量名与目标设备路径组成一个目标序列，一个 `intoItem` 包含若干目标序列。例如：`root.sg_copy.d1(s1, s2)` 
指定了两条目标序列 `root.sg_copy.d1.s1` 和 `root.sg_copy.d1.s2`。
 
-  * 例子：
-
-    ```sql
-    select s1, s1 
-    into t1, t2 
-    from root.sg.d1
-    ```
-    
-    这等价于
-    
-    ```sql
-    select s1, s1 
-    into root.sg.d1.t1, root.sg.d1.t2 
-    from root.sg.d1
-    ```
+`INTO` 子句指定的目标序列要能够与查询结果集的列一一对应。具体规则如下：
 
+- **按时间对齐**（默认）：全部 `intoItem` 包含的目标序列数量要与查询结果集的列数（除时间列外）一致，且按照表头从左到右的顺序一一对应。
+- **按设备对齐**（使用 `ALIGN BY DEVICE`）：全部 `intoItem` 中指定的目标设备数和查询的设备数（即 `FROM` 
子句中路径模式匹配的设备数）一致，且按照结果集设备的输出顺序一一对应。
+  为每个目标设备指定的目标物理量数量要与查询结果集的列数（除时间和设备列外）一致，且按照表头从左到右的顺序一一对应。
 
+下面通过示例进一步说明：
 
-**在`intoPath` 中，您还可以使用 `${i}`风格的路径匹配符来表示`from`子句中的部分路径。**
+- **示例 1**（按时间对齐）
+```shell
+IoTDB> select s1, s2 into root.sg_copy.d1(t1), root.sg_copy.d2(t1, t2), 
root.sg_copy.d1(t2) from root.sg.d1, root.sg.d2;
++--------------+-------------------+--------+
+| source column|  target timeseries| written|
++--------------+-------------------+--------+
+| root.sg.d1.s1| root.sg_copy.d1.t1|    8000|
++--------------+-------------------+--------+
+| root.sg.d2.s1| root.sg_copy.d2.t1|   10000|
++--------------+-------------------+--------+
+| root.sg.d1.s2| root.sg_copy.d2.t2|   12000|
++--------------+-------------------+--------+
+| root.sg.d2.s2| root.sg_copy.d1.t2|   10000|
++--------------+-------------------+--------+
+Total line number = 4
+It costs 0.725s
+```
 
-比如，对于路径`root.sg1.d1.v1`而言，`${1}`表示`sg1`，`${2}`表示`d1`，`${3}`表示`v1`。
+该语句将 `root.sg` 存储组下四条序列的查询结果写入到 `root.sg_copy` 
存储组下指定的四条序列中。注意，`root.sg_copy.d2(t1, t2)` 也可以写做 `root.sg_copy.d2(t1), 
root.sg_copy.d2(t2)`。
+
+可以看到，`INTO` 子句的写法非常灵活，只要满足组合出的目标序列没有重复，且与查询结果列一一对应即可。
+
+> `CLI` 展示的结果集中，各列的含义如下：
+> - `source column` 列表示查询结果的列名。
+> - `target timeseries` 表示对应列写入的目标序列。
+> - `written` 表示预期写入的数据量。
+
+- **示例 2**（按时间对齐）
+```shell
+IoTDB> select count(s1 + s2), last_value(s2) into root.agg.count(s1_add_s2), 
root.agg.last_value(s2) from root.sg.d1 group by ([0, 100), 10ms);
++--------------------------------------+-------------------------+--------+
+|                         source column|        target timeseries| written|
++--------------------------------------+-------------------------+--------+
+|  count(root.sg.d1.s1 + root.sg.d1.s2)| root.agg.count.s1_add_s2|      10|
++--------------------------------------+-------------------------+--------+
+|             last_value(root.sg.d1.s2)|   root.agg.last_value.s2|      10|
++--------------------------------------+-------------------------+--------+
+Total line number = 2
+It costs 0.375s
+```
 
+该语句将聚合查询的结果存储到指定序列中。
+
+- **示例 3**（按设备对齐）
+```shell
+IoTDB> select s1, s2 into root.sg_copy.d1(t1, t2), root.sg_copy.d2(t1, t2) 
from root.sg.d1, root.sg.d2 align by device;
++--------------+--------------+-------------------+--------+
+| source device| source column|  target timeseries| written|
++--------------+--------------+-------------------+--------+
+|    root.sg.d1|            s1| root.sg_copy.d1.t1|    8000|
++--------------+--------------+-------------------+--------+
+|    root.sg.d1|            s2| root.sg_copy.d1.t2|   11000|
++--------------+--------------+-------------------+--------+
+|    root.sg.d2|            s1| root.sg_copy.d2.t1|   12000|
++--------------+--------------+-------------------+--------+
+|    root.sg.d2|            s2| root.sg_copy.d2.t2|    9000|
++--------------+--------------+-------------------+--------+
+Total line number = 4
+It costs 0.625s
+```
 
-  * 例子：
+该语句同样是将 `root.sg` 存储组下四条序列的查询结果写入到 `root.sg_copy` 
存储组下指定的四条序列中。但在按设备对齐中，`intoItem` 的数量必须和查询的设备数量一致，每个查询设备对应一个 `intoItem`。
 
-    ```sql
-    select s1, s1, s1
-    into ${1}_t1, ${2}, root.${2}.${1}.t2
-    from root.sg.d1
-    ```
-    
-    这等价于
-    
-    ```sql
-    select s1, s1, s1
-    into root.sg.d1.sg_t1, root.sg.d1.d1, root.d1.sg.t2
-    from root.sg.d1
-    ```
+容易看出，在按设备对齐查询的情况下，同一个设备下的列只能写入相同的目标设备。
 
+> 按设备对齐查询时，`CLI` 展示的结果集多出一列 `source device` 列表示查询的设备。
 
+- **示例 4**（按设备对齐）
+```shell
+IoTDB> select s1 + s2 into root.expr.add(d1s1_d1s2), root.expr.add(d2s1_d2s2) 
from root.sg.d1, root.sg.d2 align by device;
++--------------+--------------+------------------------+--------+
+| source device| source column|       target timeseries| written|
++--------------+--------------+------------------------+--------+
+|    root.sg.d1|       s1 + s2| root.expr.add.d1s1_d1s2|   10000|
++--------------+--------------+------------------------+--------+
+|    root.sg.d2|       s1 + s2| root.expr.add.d2s1_d2s2|   10000|
++--------------+--------------+------------------------+--------+
+Total line number = 2
+It costs 0.532s
+```
 
-**您可以通过关键词  `ALIGNED` 指定 `intoPath`（目标序列）是否为一个对齐时间序列。**
+该语句将表达式计算的结果存储到指定序列中。
 
-当目标序列存在时，您需要保证源序列和目标时间序列的类型匹配。
+### 使用变量占位符
 
-当目标序列不存在时，系统将自动创建一个新的目标对齐时间序列。
+特别地，可以使用变量占位符描述目标序列与查询序列之间的对应规律，简化语句书写。目前支持以下两种变量占位符：
+ 
+- 后缀复制符 
`::`：复制查询设备后缀（或物理量），表示从该层开始一直到设备的最后一层（或物理量），目标设备的节点名（或物理量名）与查询的设备对应的节点名（或物理量名）相同。
+- 单层节点匹配符 
`${i}`：表示目标序列当前层节点名与查询序列的第`i`层节点名相同。比如，对于路径`root.sg1.d1.s1`而言，`${1}`表示`sg1`，`${2}`表示`d1`，`${3}`表示`s1`。
 
+注意：变量占位符**只能描述序列与序列之间的对应关系**，不能用于函数、表达式等。
 
-  * 例子：
+在使用变量占位符时，`intoItem`与查询结果集列的对应关系不能存在歧义，具体情况分类讨论如下：
 
-    ```sql
-    select s1, s2, s3
-    into aligned root.sg.d2.t1, root.sg.d2.t2, root.sg.d2.t3
-    from root.sg.d1
-    ```
+#### 按时间对齐（默认）
 
+> 注：如果查询中包含聚合、表达式计算，此时查询结果中的列无法与某个序列对应，因此目标设备和目标物理量都不能使用变量占位符。 
 
+##### （1）目标设备不使用变量占位符 & 目标物理量列表使用变量占位符
+  
+**限制：**
+   1. 每个 `intoItem` 中，物理量列表的长度必须为 1。<br>（如果长度可以大于1，例如 `root.sg1.d1(::, 
s1)`，无法确定具体哪些列与`::`匹配）
+   2. `intoItem` 数量为 1，或与查询结果集列数一致。<br>（在每个目标物理量列表长度均为 1 的情况下，若 `intoItem` 只有 
1 个，此时表示全部查询序列写入相同设备；若 `intoItem` 数量与查询序列一致，则表示为每个查询序列指定一个目标设备；若 `intoItem` 大于 
1 小于查询序列数，此时无法与查询序列一一对应）
 
+**匹配方法：** 每个查询序列指定目标设备，而目标物理量根据变量占位符生成。
 
-### 支持写回的查询类型
+**示例：**
 
-**注意，除了下述类型的查询，其余类型的查询（如`LAST`查询和原始聚合查询）都不被支持。**
+```sql
+select s1, s2
+into root.sg_copy.d1(::), rot.sg_copy.d2(s1), root.sg_copy.d1(${3}), 
root.sg_copy.d2(::),
+from root.sg.d1, root.sg.d2;
+```
+该语句等价于：
+```sql
+select s1, s2
+into root.sg_copy.d1(s1), rot.sg_copy.d2(s1), root.sg_copy.d1(s2), 
root.sg_copy.d2(s2),
+from root.sg.d1, root.sg.d2;
+```
+可以看到，在这种情况下，语句并不能得到很好地简化。
 
-* 原始序列查询
+##### （2）目标设备使用变量占位符 & 目标物理量列表不使用变量占位符
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  ```
+**限制：** 全部 `intoItem` 中目标物理量的数量与查询结果集列数一致。
 
-* 时间序列生成函数查询（UDF查询）
+**匹配方式：** 为每个查询序列指定了目标物理量，目标设备根据对应目标物理量所在 `intoItem` 的目标设备占位符生成。
 
-  ```sql
-  select s1, sin(s2) 
-  into t1, t2 
-  from root.sg.d1
-  ```
+**示例：**
+```sql
+select d1.s1, d1.s2, d2.s3, d3.s4 
+into ::(s1_1, s2_2), root.sg.d2_2(s3_3), root.${2}_copy.::(s4)
+from root.sg;
+```
 
-* 数学表达式查询
+##### （3）目标设备使用变量占位符 & 目标物理量列表使用变量占位符
 
-  ```sql
-  select s1, sin(s2), s1 + s3 
-  into t1, t2, t3 
-  from root.sg.d1
-  ```
+**限制：** `intoItem` 只有一个且物理量列表的长度为 1。
 
-* 嵌套查询
+**匹配方式：** 每个查询序列根据变量占位符可以得到一个目标序列。
 
-  ```sql
-  select -s1, sin(cos(tan(s1 + s2 * s3))) + cos(s3), top_k(s1 + s3, 'k'='1') 
-  into t1, t2, t3 
-  from root.sg.d1
-  ```
+**示例：**
+```sql
+select * into root.sg_bk.::(::) from root.sg.**;
+```
+将 `root.sg` 下全部序列的查询结果写到 `root.sg_bk`，设备名后缀和物理量名保持不变。
 
-* Fill 查询
+#### 按设备对齐（使用 `ALIGN BY DEVICE`）
 
-  ```sql
-  select s1 
-  into fill_s1 
-  from root.sg.d1 
-  where time = 10 
-  fill(float [linear, 1ms, 1ms])
-  ```
+> 注：如果查询中包含聚合、表达式计算，此时查询结果中的列无法与某个物理量对应，因此目标物理量不能使用变量占位符。
 
-* Group By 查询
+##### （1）目标设备不使用变量占位符 & 目标物理量列表使用变量占位符
 
-  ```sql
-  select count(s1) 
-  into group_by_s1 
-  from root.sg.d1 
-  group by ([1, 5), 1ms)
-  ```
+**限制：** 每个 `intoItem` 中，如果物理量列表使用了变量占位符，则列表的长度必须为 1。
 
-* Group By Fill 查询
+**匹配方法：** 每个查询序列指定目标设备，而目标物理量根据变量占位符生成。
 
-       ```sql
-  select last_value(s1) 
-  into group_by_fill_s1 
-  from root.sg.d1 
-  group by ([1, 10),1ms) 
-  fill (float[PREVIOUS])
-  ```
+**示例：**
+```sql
+select s1, s2, s3, s4
+into root.backup_sg.d1(s1, s2, s3, s4), root.backup_sg.d2(::), 
root.sg.d3(backup_${4})
+from root.sg.d1, root.sg.d2, root.sg.d3
+align by device;
+```
 
+##### （2）目标设备使用变量占位符 & 目标物理量列表不使用变量占位符
 
+**限制：** `intoItem` 只有一个。（如果出现多个带占位符的 `intoItem`，我们将无法得知每个 `intoItem` 
需要匹配哪几个源设备）
 
-### 支持写回的查询子句
+**匹配方式：** 每个查询设备根据变量占位符得到一个目标设备，每个设备下结果集各列写入的目标物理量由目标物理量列表指定。
 
-**注意，除了下述子句，其余查询子句（如 `DESC` / `SOFFSET` 等）都不被支持。**
+**示例：**
+```sql
+select avg(s1), sum(s2) + sum(s3), count(s4)
+into root.agg_${2}.::(avg_s1, sum_s2_add_s3, count_s4)
+from root.**
+align by device;
+```
 
-* 支持值过滤
+##### （3）目标设备使用变量占位符 & 目标物理量列表使用变量占位符
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  where s1 > 0 and s2 < 0
-  ```
+**限制：** `intoItem` 只有一个且物理量列表的长度为 1。
 
-* 支持时间过滤
+**匹配方式：** 每个查询序列根据变量占位符可以得到一个目标序列。
 
-    ```sql
-    select s1, s1 
-    into t1, t2 
-    from root.sg.d1
-    where time > 0
-    ```
+**示例：**
+```sql
+select * into ::(backup_${4}) from root.sg.** align by device;
+```
+将 `root.sg` 下每条序列的查询结果写到相同设备下，物理量名前加`backup_`。
 
-* LIMIT / OFFSET
+### 指定目标序列为对齐序列
 
-  ```sql
-  select s1, s1 
-  into t1, t2 
-  from root.sg.d1
-  limit 5 offset 1000
-  ```
+通过 `ALIGNED` 关键词可以指定写入的目标设备为对齐写入，每个 `intoItem` 可以独立设置。
 
+### 不支持使用的查询子句
 
+- `SLIMIT`、`SOFFSET`：查询出来的列不确定，功能不清晰，因此不支持。
+- `LAST`、`GROUP BY TAGS`、`DISABLE ALIGN`：表结构和写入结构不一致，因此不支持。
 
-### 其他限制
+### 其他要注意的点
 
-* `select`子句中的源序列和`into`子句中的目标序列数量必须相同
-* `select`子句不支持带 `*`/`**` 查询
-* 
`into`子句中的目标序列不必预先创建（可使用自动创建schema功能），但是当`into`子句中的目标序列已存在时，您需要保证`select`子句中的源序列和`into`子句中的目标序列的数据类型一致
-* `into`子句中的目标序列必须是互不相同的
-* `from`子句只允许有一列序列前缀
-* `from`子句不支持带 `*`/`**`
-* 由于时间序列生成函数查询（UDF查询）/ 数学表达式查询 / 嵌套查询 尚不支持对齐时间序列（Aligned 
Timeseries），所以如果您在`select`子句中使用了上述查询，并且对应操作数包含对齐时间序列，会提示错误
+- 对于一般的聚合查询，时间戳是无意义的，约定使用 0 来存储。
+- 当目标序列存在时，需要保证源序列和目标时间序列的数据类型、压缩和编码方式、是否属于对齐设备等元数据信息一致。
+- 当目标序列不存在时，系统将自动创建目标序列（包括存储组）。
+- 当查询的序列不存在或查询的序列不存在数据，则不会自动创建目标序列。
 
+## 应用举例
 
+### 实现 IoTDB 内部 ETL
+对原始数据进行 ETL 处理后写入新序列。
+```shell
+IOTDB > SELECT preprocess_udf(*) INTO ::(preprocessed_${3}) FROM root.sg.*;
++-------------------------------+---------------------------+--------+
+|                  source column|          target timeseries| written|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d1.s1)| root.sg.d1.preprocessed_s1|    8000|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d1.s2)| root.sg.d1.preprocessed_s1|   10000|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d2.s1)| root.sg.d2.preprocessed_s1|   11000|
++-------------------------------+---------------------------+--------+
+|  preprocess_udf(root.sg.d2.s2)| root.sg.d2.preprocessed_s1|    9000|
++-------------------------------+---------------------------+--------+
+```
+以上语句使用自定义函数对数据进行预处理，将预处理后的结果持久化存储到新序列中。
+
+### 查询结果存储
+将查询结果进行持久化存储，起到类似物化视图的作用。
+```shell
+IOTDB > SELECT count(s1), last_value(s1) INTO root.sg.agg_${2}(count_s1, 
last_value_s1) FROM root.sg1.d1 GROUP BY ([0, 10000), 10ms);
++--------------------------+-----------------------------+--------+
+|             source column|            target timeseries| written|
++--------------------------+-----------------------------+--------+
+|      count(root.sg.d1.s1)|      root.sg.agg_d1.count_s1|    1000|
++--------------------------+-----------------------------+--------+
+| last_value(root.sg.d1.s2)| root.sg.agg_d1.last_value_s2|    1000|
++--------------------------+-----------------------------+--------+
+Total line number = 2
+It costs 0.115s
+```
+以上语句将降采样查询的结果持久化存储到新序列中。
+
+### 非对齐序列转对齐序列
+对齐序列从 0.13 版本开始支持，可以通过该功能将历史的非对齐序列重新写成对齐序列。
+
+**注意：** 建议配合使用 `LIMIT & OFFSET` 子句或 `WHERE` 子句（时间过滤条件）对数据进行分批，防止单次操作的数据量过大。
+
+```shell
+IOTDB > SELECT s1, s2 INTO ALIGNED root.sg1.aligned_d(s1, s2) FROM 
root.sg1.non_aligned_d WHERE time >= 0 and time < 10000;
++--------------------------+----------------------+--------+
+|             source column|     target timeseries| written|
++--------------------------+----------------------+--------+
+| root.sg1.non_aligned_d.s1| root.sg1.aligned_d.s1|   10000|
++--------------------------+----------------------+--------+
+| root.sg1.non_aligned_d.s2| root.sg1.aligned_d.s2|   10000|
++--------------------------+----------------------+--------+
+Total line number = 2
+It costs 0.375s
+```
+以上语句将一组非对齐的序列的数据迁移到一组对齐序列。
 
-## 权限
+## 相关用户权限
 
 用户必须有下列权限才能正常执行查询写回语句：
 
-* 所有 `select` 子句中源序列的 `READ_TIMESERIES` 权限
-* 所有 `into` 子句中目标序列 `INSERT_TIMESERIES` 权限
+* 所有 `SELECT` 子句中源序列的 `READ_TIMESERIES` 权限。
+* 所有 `INTO` 子句中目标序列 `INSERT_TIMESERIES` 权限。
 
 更多用户权限相关的内容，请参考[权限管理语句](../Administration-Management/Administration.md)。
 
+## 相关配置参数
 
+* `select_into_insert_tablet_plan_row_limit`
 
-## 配置参数
-
-* `select_into_insert_tablet_plan_row_limit`：执行 select-into 语句时，一个 
insert-tablet-plan 中可以处理的最大行数。 默认为 10000。
+  | 参数名 | select_into_insert_tablet_plan_row_limit |
+  | ---- | ---- | 
+  | 描述 | 写入过程中每一批 `Tablet` 的最大行数 | 
+  | 类型 | int32 | 
+  | 默认值 | 10000 | 
+  | 改后生效方式 | 重启后生效 |

[iotdb] branch master updated: [IOTDB-4718] Update user guide of SELECT INTO statement (#7771)

Reply via email to