This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 8e57f8404d86 [docs](doc) update en doc (#503)
8e57f8404d86 is described below
commit 8e57f8404d86e9bf71e077d633e767488feee7ae
Author: Hu Yanjun <[email protected]>
AuthorDate: Mon Apr 1 21:56:40 2024 +0800
[docs](doc) update en doc (#503)
---
.../data-operate/import/insert-into-manual.md | 487 ++++++++++++++-------
.../data-operate/import/mysql-load-manual.md | 287 ++++++++----
.../version-2.0/table-design/auto-increment.md | 421 ++++--------------
3 files changed, 624 insertions(+), 571 deletions(-)
diff --git
a/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md
b/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md
index f05200ad5b06..453ed86214f9 100644
--- a/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md
+++ b/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md
@@ -24,245 +24,414 @@ specific language governing permissions and limitations
under the License.
-->
-# Insert Into
+The INSERT INTO statement supports importing the results of a Doris query into
another table. INSERT INTO is a synchronous import method, where the import
result is returned after the import is executed. Whether the import is
successful can be determined based on the returned result. INSERT INTO ensures
the atomicity of the import task, meaning that either all the data is imported
successfully or none of it is imported.
-The use of Insert Into statements is similar to that of Insert Into statements
in databases such as MySQL. But in Doris, all data writing is a separate import
job. So Insert Into is also introduced here as an import method.
+There are primarily two main forms of the INSERT INTO command:
-The main Insert Into command contains the following two kinds;
+- INSERT INTO tbl SELECT...
+- INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...)
-- INSERT INTO tbl SELECT ...
-- INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);
+## Applicable scenarios
-The second command is for Demo only, not in a test or production environment.
+1. If a user wants to import only a few test data records to verify the
functionality of the Doris system, the INSERT INTO VALUES syntax is applicable.
It is similar to the MySQL syntax. However, it is not recommended to use INSERT
INTO VALUES in a production environment.
+2. If a user wants to perform ETL on existing data in a Doris table and then
import it into a new Doris table, the INSERT INTO SELECT syntax is applicable.
+3. In conjunction with the Multi-Catalog external table mechanism, tables from
MySQL or Hive systems can be mapped via Multi-Catalog. Then, data from external
tables can be imported into Doris tables using the INSERT INTO SELECT syntax.
+4. Utilizing the Table Value Functions (TVFs), users can directly query data
stored in object storage or files on HDFS as tables, with automatic column type
inference. Then, data from external tables can be imported into Doris tables
using the INSERT INTO SELECT syntax.
-## Import operations and load results
+## Implementation
-The Insert Into command needs to be submitted through MySQL protocol. Creating
an import request returns the import result synchronously.
+When using INSERT INTO, the import job needs to be initiated and submitted to
the FE node using the MySQL protocol. The FE generates an execution plan, which
includes query-related operators, with the last operator being the
OlapTableSink. The OlapTableSink operator is responsible for writing the query
result to the target table. The execution plan is then sent to the BE nodes for
execution. Doris designates one BE node as the Coordinator, which receives and
distributes the data to other [...]
-The following are examples of the use of two Insert Intos:
+## Get started
-```sql
-INSERT INTO tbl2 WITH LABEL label1 SELECT * FROM tbl3;
-INSERT INTO tbl1 VALUES ("qweasdzxcqweasdzxc"), ("a");
+An INSERT INTO job is submitted and transmitted using the MySQL protocol. The
following example demonstrates submitting an import job using INSERT INTO
through the MySQL command-line interface.
+
+Detailed syntax can be found in the INSERT INTO documentation.
+
+### Preparation
+
+INSERT INTO requires INSERT permissions on the target table. You can grant
permissions to user accounts using the GRANT command.
+
+### Create an INSERT INTO job
+
+**INSERT INTO VALUES**
+
+1. Create a source table
+
+```SQL
+CREATE TABLE testdb.test_table(
+ user_id BIGINT NOT NULL COMMENT "User ID",
+ name VARCHAR(20) COMMENT "User name",
+ age INT COMMENT "User age"
+)
+DUPLICATE KEY(user_id)
+DISTRIBUTED BY HASH(user_id) BUCKETS 10;
+```
+
+2. Import data into the source table using `INSERT INTO VALUES` (not
recommended for production environments).
+
+```SQL
+INSERT INTO testdb.test_table (user_id, name, age)
+VALUES (1, "Emily", 25),
+ (2, "Benjamin", 35),
+ (3, "Olivia", 28),
+ (4, "Alexander", 60),
+ (5, "Ava", 17);
+```
+
+INSERT INTO is a synchronous import method, where the import result is
directly returned to the user.
+
+```JSON
+Query OK, 5 rows affected (0.308 sec)
+{'label':'label_3e52da787aab4222_9126d2fce8f6d1e5', 'status':'VISIBLE',
'txnId':'9081'}
+```
+
+3. View imported data.
+
+```SQL
+MySQL> SELECT COUNT(*) FROM testdb.test_table;
++----------+
+| count(*) |
++----------+
+| 5 |
++----------+
+1 row in set (0.179 sec)
```
-> Note: When you need to use `CTE(Common Table Expressions)` as the query part
in an insert operation, you must specify the `WITH LABEL` and column list parts
or wrap `CTE`. Example:
->
-> ```sql
-> INSERT INTO tbl1 WITH LABEL label1
-> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
-> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1;
->
->
-> INSERT INTO tbl1 (k1)
-> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
-> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1;
->
-> INSERT INTO tbl1 (k1)
-> select * from (
-> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
-> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1) as ret
-> ```
+**INSERT INTO SELECT**
-For specific parameter description, you can refer to [INSERT
INTO](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT.md)
command or execute `HELP INSERT ` to see its help documentation for better use
of this import method.
+1. Building upon the above operations, create a new table as the target table
(with the same schema as the source table).
+```SQL
+CREATE TABLE testdb.test_table2 LIKE testdb.test_table;
+```
-Insert Into itself is a SQL command, and the return result is divided into the
following types according to the different execution results:
+2. Ingest data into the new table using `INSERT INTO SELECT`.
-1. Result set is empty
+```SQL
+INSERT INTO testdb.test_table2
+SELECT * FROM testdb.test_table WHERE age < 30;
+Query OK, 3 rows affected (0.544 sec)
+{'label':'label_9c2bae970023407d_b2c5b78b368e78a7', 'status':'VISIBLE',
'txnId':'9084'}
+```
+
+3. View imported data.
+
+```SQL
+MySQL> SELECT COUNT(*) FROM testdb.test_table2;
++----------+
+| count(*) |
++----------+
+| 3 |
++----------+
+1 row in set (0.071 sec)
+```
- If the result set of the insert corresponding SELECT statement is empty, it
is returned as follows:
+### View INSERT INTO jobs
- ```text
- mysql> insert into tbl1 select * from empty_tbl;
- Query OK, 0 rows affected (0.02 sec)
- ```
+You can use the `SHOW LOAD` command to view the completed INSERT INTO tasks.
- `Query OK` indicates successful execution. `0 rows affected` means that no
data was loaded.
+```SQL
+MySQL> SHOW LOAD FROM testdb;
++--------+-----------------------------------------+----------+--------------------+--------+---------+----------------------------------------------------------------------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+------+-----------------------------------------------------------------------------------------------------------------------+---------------+--------------+------+---------+
+| JobId | Label | State | Progress
| Type | EtlInfo | TaskInfo
| ErrorMsg | CreateTime | EtlStartTime |
EtlFinishTime | LoadStartTime | LoadFinishTime | URL |
JobDetails
| TransactionId | ErrorTablets | User |
Comment |
++--------+-----------------------------------------+----------+--------------------+--------+---------+----------------------------------------------------------------------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+------+-----------------------------------------------------------------------------------------------------------------------+---------------+--------------+------+---------+
+| 376416 | label_3e52da787aab4222_9126d2fce8f6d1e5 | FINISHED | Unknown id:
376416 | INSERT | NULL | cluster:N/A; timeout(s):26200;
max_filter_ratio:0.0; priority:NORMAL | NULL | 2024-02-27 01:22:17 |
2024-02-27 01:22:17 | 2024-02-27 01:22:17 | 2024-02-27 01:22:17 | 2024-02-27
01:22:18 | | {"Unfinished
backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All
backends":{},"FileNumber":0,"FileSize":0} | 9081 | {} | root
| |
+| 376664 | label_9c2bae970023407d_b2c5b78b368e78a7 | FINISHED | Unknown id:
376664 | INSERT | NULL | cluster:N/A; timeout(s):26200;
max_filter_ratio:0.0; priority:NORMAL | NULL | 2024-02-27 01:39:37 |
2024-02-27 01:39:37 | 2024-02-27 01:39:37 | 2024-02-27 01:39:37 | 2024-02-27
01:39:38 | | {"Unfinished
backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All
backends":{},"FileNumber":0,"FileSize":0} | 9084 | {} | root
| |
++--------+-----------------------------------------+----------+--------------------+--------+---------+----------------------------------------------------------------------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+------+-----------------------------------------------------------------------------------------------------------------------+---------------+--------------+------+---------+
+```
-2. The result set is not empty
+### Cancel INSERT INTO jobs
- In the case where the result set is not empty. The returned results are
divided into the following situations:
+You can cancel the currently executing INSERT INTO job via Ctrl-C.
- 1. Insert is successful and data is visible:
+## Manual
- ```text
- mysql> insert into tbl1 select * from tbl2;
- Query OK, 4 rows affected (0.38 sec)
- {'label': 'insert_8510c568-9eda-4173-9e36-6adc7d35291c', 'status':
'visible', 'txnId': '4005'}
-
- mysql> insert into tbl1 with label my_label1 select * from tbl2;
- Query OK, 4 rows affected (0.38 sec)
- {'label': 'my_label1', 'status': 'visible', 'txnId': '4005'}
-
- mysql> insert into tbl1 select * from tbl2;
- Query OK, 2 rows affected, 2 warnings (0.31 sec)
- {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':
'visible', 'txnId': '4005'}
-
- mysql> insert into tbl1 select * from tbl2;
- Query OK, 2 rows affected, 2 warnings (0.31 sec)
- {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':
'committed', 'txnId': '4005'}
- ```
+### Syntax
- `Query OK` indicates successful execution. `4 rows affected` means that
a total of 4 rows of data were imported. `2 warnings` indicates the number of
lines to be filtered.
+The syntax of INSERT INTO is as follows:
- Also returns a json string:
+1. INSERT INTO SELECT
- ```text
- {'label': 'my_label1', 'status': 'visible', 'txnId': '4005'}
- {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':
'committed', 'txnId': '4005'}
- {'label': 'my_label1', 'status': 'visible', 'txnId': '4005', 'err':
'some other error'}
- ```
+INSERT INTO SELECT is used to write the query results to the target table.
- `label` is a user-specified label or an automatically generated label.
Label is the ID of this Insert Into load job. Each load job has a label that is
unique within a single database.
+```SQL
+INSERT INTO target_table SELECT ... FROM source_table;
+```
- `status` indicates whether the loaded data is visible. If visible, show
`visible`, if not, show`committed`.
+The SELECT statement above is similar to a regular SELECT query, allowing
operations such as WHERE and JOIN.
- `txnId` is the id of the load transaction corresponding to this insert.
+2. INSERT INTO VALUES
- The `err` field displays some other unexpected errors.
+INSERT INTO VALUES is typically used for testing purposes. It is not
recommended for production environments.
- When user need to view the filtered rows, the user can use the following
statement
+```SQL
+INSERT INTO target_table (col1, col2, ...)
+VALUES (val1, val2, ...), (val3, val4, ...), ...;
+```
- ```text
- show load where label = "xxx";
- ```
+### Parameter configuration
- The URL in the returned result can be used to query the wrong data. For
details, see the following **View Error Lines** Summary. **"Data is not
visible" is a temporary status, this batch of data must be visible eventually**
+**FE** **configuration**
- You can view the visible status of this batch of data with the following
statement:
+insert_load_default_timeout_second
- ```text
- show transaction where id = 4005;
- ```
+- Default value: 14400s (4 hours)
+- Description: Timeout for import tasks, measured in seconds. If the import
task does not complete within this timeout period, it will be canceled by the
system and marked as CANCELLED.
- If the `TransactionStatus` column in the returned result is `visible`,
the data is visible.
+**Environment parameters**
- 2. Insert fails
+insert_timeout
- Execution failure indicates that no data was successfully loaded, and
returns as follows:
+- Default value: 14400s (4 hours)
+- Description: Timeout for INSERT INTO as an SQL statement, measured in
seconds.
- ```text
- mysql> insert into tbl1 select * from tbl2 where k1 = "a";
- ERROR 1064 (HY000): all partitions have no load data. Url:
http://10.74.167.16:8042/api/_load_error_log?file=__shard_2/error_log_insert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de850e8de850
- ```
+enable_insert_strict
- Where `ERROR 1064 (HY000): all partitions have no load data` shows the
reason for the failure. The latter url can be used to query the wrong data. For
details, see the following **View Error Lines** Summary.
+- Default value: true
+- Description: If this is set to true, INSERT INTO will fail when the task
involves invalid data. If set to false, INSERT INTO will ignore invalid rows,
and the import will be considered successful as long as at least one row is
imported successfully.
+- Explanation: INSERT INTO cannot control the error rate, so this parameter is
used to either strictly check data quality or completely ignore invalid data.
Common reasons for data invalidity include: source data column length exceeding
destination column length, column type mismatch, partition mismatch, and column
order mismatch.
-**In summary, the correct processing logic for the results returned by the
insert operation should be:**
+### Return values
-1. If the returned result is `ERROR 1064 (HY000)`, it means that the import
failed.
-2. If the returned result is `Query OK`, it means the execution was successful.
- 1. If `rows affected` is 0, the result set is empty and no data is loaded.
- 2. If`rows affected` is greater than 0:
- 1. If `status` is`committed`, the data is not yet visible. You need to
check the status through the `show transaction` statement until `visible`.
- 2. If `status` is`visible`, the data is loaded successfully.
- 3. If `warnings` is greater than 0, it means that some data is filtered.
You can get the url through the `show load` statement to see the filtered rows.
+INSERT INTO an SQL statement, and it returns different results based on
different query outcomes:
-In the previous section, we described how to follow up on the results of
insert operations. However, it is difficult to get the json string of the
returned result in some mysql libraries. Therefore, Doris also provides the
`SHOW LAST INSERT` command to explicitly retrieve the results of the last
insert operation.
+**Empty result set**
-After executing an insert operation, you can execute `SHOW LAST INSERT` on the
same session connection. This command returns the result of the most recent
insert operation, e.g.
+If the query result set of the SELECT statement in INSERT INTO is empty, the
returned value will be as follows:
-```sql
-mysql> show last insert\G
-*************************** 1. row ***************************
- TransactionId: 64067
- Label: insert_ba8f33aea9544866-8ed77e2844d0cc9b
- Database: default_cluster:db1
- Table: t1
-TransactionStatus: VISIBLE
- LoadedRows: 2
- FilteredRows: 0
+```SQL
+mysql> INSERT INTO tbl1 SELECT * FROM empty_tbl;
+Query OK, 0 rows affected (0.02 sec)
```
-This command returns the insert results and the details of the corresponding
transaction. Therefore, you can continue to execute the `show last insert`
command after each insert operation to get the insert results.
+`Query OK` indicates successful execution. `0 rows affected` means no data was
imported.
-> Note: This command will only return the results of the last insert operation
within the same session connection. If the connection is broken or replaced
with a new one, the empty set will be returned.
+**Non-empty result set and successful INSERT**
-## Relevant System Configuration
+```SQL
+mysql> INSERT INTO tbl1 SELECT * FROM tbl2;
+Query OK, 4 rows affected (0.38 sec)
+{'label':'INSERT_8510c568-9eda-4173-9e36-6adc7d35291c', 'status':'visible',
'txnId':'4005'}
-### FE configuration
+mysql> INSERT INTO tbl1 WITH LABEL my_label1 SELECT * FROM tbl2;
+Query OK, 4 rows affected (0.38 sec)
+{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
-- time out
+mysql> INSERT INTO tbl1 SELECT * FROM tbl2;
+Query OK, 2 rows affected, 2 warnings (0.31 sec)
+{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'visible',
'txnId':'4005'}
- The timeout time of the import task (in seconds) will be cancelled by the
system if the import task is not completed within the set timeout time, and
will become CANCELLED.
+mysql> INSERT INTO tbl1 SELECT * FROM tbl2;
+Query OK, 2 rows affected, 2 warnings (0.31 sec)
+{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed',
'txnId':'4005'}
+```
- At present, Insert Into does not support custom import timeout time. All
Insert Into imports have a uniform timeout time. The default timeout time is 4
hours. If the imported source file cannot complete the import within the
specified time, the parameter `insert_load_default_timeout_second` of FE needs
to be adjusted.
+`Query OK` indicates successful execution. `4 rows affected` indicates that a
total of 4 rows of data have been imported. `2 warnings` indicates the number
of rows that were filtered out.
- <version since="dev"></version>
- At the same time, the Insert Into statement receives the restriction of the
Session variable `insert_timeout`. You can increase the timeout time by `SET
insert_timeout = xxx;` in seconds.
+Additionally, a JSON string is returned:
-### Session Variables
+```Plain
+{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
+{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed',
'txnId':'4005'}
+{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other
error'}
+```
-- enable_insert_strict
+Parameter description:
- The Insert Into import itself cannot control the tolerable error rate of the
import. Users can only use the Session parameter `enable_insert_strict`. When
this parameter is set to false, it indicates that at least one data has been
imported correctly, and then it returns successfully. When this parameter is
set to true, the import fails if there is a data error. The default is false.
It can be set by `SET enable_insert_strict = true;`.
+| Parameter | Description |
+| --------- | ------------------------------------------------------------ |
+| TxnId | ID of the import transaction |
+| Label | Label of the import job: can be specified using "INSERT INTO tbl
WITH LABEL label..." |
+| Status | Visibility of the imported data: If it is visible, it will be
displayed as "visible." If not, it will be displayed as "committed." In the
"committed" state, the import is completed, but the data may be delayed in
becoming visible. There is no need to retry in this case.`visible`: The import
is successful and the data is visible.`committed`: The import is completed, but
the data may be delayed in becoming visible. There is no need to retry in this
case.Label Already Exists: [...]
+| Err | Error message |
-- query_timeout
+You can use the [SHOW
LOAD](https://doris.apache.org/docs/sql-manual/sql-reference/Show-Statements/SHOW-LOAD/)
statement to view the filtered rows.
- Insert Into itself is also an SQL command, and the Insert Into statement is
restricted by the Session variable <version since="dev"
type="inline">`insert_timeout`</version>. You can increase the timeout time by
`SET insert_timeout = xxx;` in seconds.
+```SQL
+SHOW LOAD WHERE label="xxx";
+```
-- enable_nereids_dml_with_pipeline
+The result of this statement will include a URL that can be used to query the
error data. For more details, refer to the "View error rows" section below.
- When set to `true`, the `insert into` statement will attempt to be executed
through the Pipeline engine. See the [import](./load-manual) documentation for
details.
+The invisible state of data is temporary, and the data will eventually become
visible.
-## Best Practices
+You can check the visibility status of a batch of data using the [SHOW
TRANSACTION](https://doris.apache.org/docs/sql-manual/sql-reference/Show-Statements/SHOW-TRANSACTION/)
statement.
-### Application scenarios
+```SQL
+SHOW TRANSACTION WHERE id=4005;
+```
-1. Users want to import only a few false data to verify the functionality of
Doris system. The grammar of INSERT INTO VALUES is suitable at this time.
-2. Users want to convert the data already in the Doris table into ETL and
import it into a new Doris table, which is suitable for using INSERT INTO
SELECT grammar.
-3. Users can create an external table, such as MySQL external table mapping a
table in MySQL system. Or create Broker external tables to map data files on
HDFS. Then the data from the external table is imported into the Doris table
for storage through the INSERT INTO SELECT grammar.
+If the `TransactionStatus` column in the result is `visible`, it indicates
that the data is visible.
-### Data volume
+```SQL
+{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
+{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed',
'txnId':'4005'}
+{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other
error'}
+```
+
+**Non-empty result set but failed INSERT**
-Insert Into has no limitation on the amount of data, and large data imports
can also be supported. However, Insert Into has a default timeout time, and the
amount of imported data estimated by users is too large, so it is necessary to
modify the system's Insert Into import timeout time.
+Failed execution means that no data was successfully imported. An error
message will be returned:
-```text
-Import data volume = 36G or less than 3600s*10M/s
-Among them, 10M/s is the maximum import speed limit. Users need to calculate
the average import speed according to the current cluster situation to replace
10M/s in the formula.
+```SQL
+mysql> INSERT INTO tbl1 SELECT * FROM tbl2 WHERE k1 = "a";
+ERROR 1064 (HY000): all partitions have no load data. url:
http://10.74.167.16:8042/api/_load_error_log?file=_shard_2/error_loginsert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de8507c0bf8a2
```
-### Complete examples
+`ERROR 1064 (HY000): all partitions have no load data` indicates the root
cause for the failure. The URL provided in the error message can be used to
locate the error data. For more details, refer to the "View error rows" section
below.
+
+## Best practice
+
+### Data size
+
+INSERT INTO imposes no limitations on data volume and can support large-scale
data imports. However, if you are importing a large amount of data, it is
recommended to adjust the system's INSERT INTO timeout settings to ensure that
`import timeout >= data volume ``/`` estimated import speed`.
+
+1. FE configuration parameter `insert_load_default_timeout_second`
+2. Environment parameter `insert_timeout`
+
+### View error rows
-Users have a table store sales in the database sales. Users create a table
called bj store sales in the database sales. Users want to import the data
recorded in the store sales into the new table bj store sales. The amount of
data imported is about 10G.
+When the INSERT INTO result includes a URL field, you can use the following
command to view the error rows:
-```text
-large sales scheme
-(id, total, user_id, sale_timestamp, region)
+```SQL
+SHOW LOAD WARNINGS ON "url";
+```
+
+Example:
-Order large sales schedule:
-(id, total, user_id, sale_timestamp)
+```SQL
+SHOW LOAD WARNINGS ON
"http://ip:port/api/_load_error_log?file=_shard_13/error_loginsert_stmt_d2cac0a0a16d482d-9041c949a4b71605_d2cac0a0a16d482d_9041c949a4b71605";
```
-Cluster situation: The average import speed of current user cluster is about
5M/s
+Common reasons for errors include: source data column length exceeding
destination column length, column type mismatch, partition mismatch, and column
order mismatch.
+
+You can control whether INSERT INTO ignores error rows by configuring the
environment variable `enable_insert_strict`.
+
+## Ingest external data via Multi-Catalog
-- Step1: Determine whether you want to modify the default timeout of Insert
Into
+Doris supports the creation of external tables. Once created, data from
external tables can be imported into Doris internal tables using `INSERT INTO
SELECT`, or queried directly using SELECT statements.
- ```text
- Calculate the approximate time of import
- 10G / 5M /s = 2000s
-
- Modify FE configuration
- insert_load_default_timeout_second = 2000
- ```
+With its Multi-Catalog feature, Doris supports connections to various
mainstream data lakes and databases including Apache Hive, Apache Iceberg,
Apache Hudi, Apache Paimon (Incubating), Elasticsearch, MySQL, Oracle, and SQL
Server.
-- Step2: Create Import Tasks
+For more information on Multi-Catalog, please refer to [Lakehouse
overview](https://doris.apache.org/docs/2.0/lakehouse/lakehouse-overview/#multi-catalog).
- Since users want to ETL data from a table and import it into the target
table, they should use the Insert in query\stmt mode to import it.
+The followings illustrate importing data from a Hive external table into a
Doris internal table.
- ```text
- INSERT INTO bj_store_sales SELECT id, total, user_id, sale_timestamp FROM
store_sales where region = "bj";
- ```
+### Create Hive Catalog
-## Common Questions
+```SQL
+CREATE CATALOG hive PROPERTIES (
+ 'type'='hms',
+ 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
+ 'hadoop.username' = 'hive',
+ 'dfs.nameservices'='your-nameservice',
+ 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
+ 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
+ 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
+
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
+);
+```
+
+### Ingest data
+
+1. Create a target table for the data import in Doris.
+
+```SQL
+CREATE TABLE `target_tbl` (
+ `k1` decimal(9, 3) NOT NULL COMMENT "",
+ `k2` char(10) NOT NULL COMMENT "",
+ `k3` datetime NOT NULL COMMENT "",
+ `k5` varchar(20) NOT NULL COMMENT "",
+ `k6` double NOT NULL COMMENT ""
+)
+COMMENT "Doris Table"
+DISTRIBUTED BY HASH(k1) BUCKETS 2
+PROPERTIES (
+ "replication_num" = "1"
+);
+```
-- View the wrong line
+2. For detailed instructions on creating Doris tables, please refer to [CREATE
TABLE](https://doris.apache.org/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE/).
+
+3. Importing data (from the `hive.db1.source_tbl` table into the `target_tbl`
table).
+
+```SQL
+INSERT INTO target_tbl SELECT k1,k2,k3 FROM hive.db1.source_tbl limit 100;
+```
+
+The INSERT command is a synchronous command. If it returns a result, that
indicates successful import.
+
+### Notes
+
+- Ensure that the external data source and the Doris cluster can communicate,
including mutual network accessibility between BE nodes and external data
sources.
+
+## Ingest data by TVF
+
+Doris can directly query and analyze files stored in object storage or HDFS as
tables through the Table Value Functions (TVFs), which supports automatic
column type inference. For detailed information, please refer to the
Lakehouse/TVF documentation.
+
+### Automatic column type inference
+
+```Plain
+DESC FUNCTION s3 (
+ "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
+ "s3.access_key"= "ak",
+ "s3.secret_key" = "sk",
+ "format" = "parquet",
+ "use_path_style"="true"
+);
++---------------+--------------+------+-------+---------+-------+
+| Field | Type | Null | Key | Default | Extra |
++---------------+--------------+------+-------+---------+-------+
+| p_partkey | INT | Yes | false | NULL | NONE |
+| p_name | TEXT | Yes | false | NULL | NONE |
+| p_mfgr | TEXT | Yes | false | NULL | NONE |
+| p_brand | TEXT | Yes | false | NULL | NONE |
+| p_type | TEXT | Yes | false | NULL | NONE |
+| p_size | INT | Yes | false | NULL | NONE |
+| p_container | TEXT | Yes | false | NULL | NONE |
+| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
+| p_comment | TEXT | Yes | false | NULL | NONE |
++---------------+--------------+------+-------+---------+-------+
+```
+
+In this example of an S3 TVF, the file path, connection information, and
authentication information are specified.
+
+You can use the `DESC FUNCTION` syntax to view the schema of this file.
+
+It can be seen that for Parquet files, Doris automatically infers column types
based on the metadata within the file.
+
+Currently, Doris supports analysis and column type inference for Parquet, ORC,
CSV, and JSON formats.
+
+It can be used in combination with the `INSERT INTO SELECT` syntax to quickly
import files into Doris tables for faster analysis.
+
+```Plain
+// 1. Create Doris internal table
+CREATE TABLE IF NOT EXISTS test_table
+(
+ id int,
+ name varchar(50),
+ age int
+)
+DISTRIBUTED BY HASH(id) BUCKETS 4
+PROPERTIES("replication_num" = "1");
+
+// 2. Insert data by S3 Table Value Function
+INSERT INTO test_table (id,name,age)
+SELECT cast(id as INT) as id, name, cast (age as INT) as age
+FROM s3(
+ "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
+ "s3.access_key"= "ak",
+ "s3.secret_key" = "sk",
+ "format" = "parquet",
+ "use_path_style" = "true");
+```
- Because Insert Into can't control the error rate, it can only tolerate or
ignore the error data completely by `enable_insert_strict`. So if
`enable_insert_strict` is set to true, Insert Into may fail. If
`enable_insert_strict` is set to false, then only some qualified data may be
imported. However, in either case, Doris is currently unable to provide the
ability to view substandard data rows. Therefore, the user cannot view the
specific import error through the Insert Into statement.
+### Notes
- The causes of errors are usually: source data column length exceeds
destination data column length, column type mismatch, partition mismatch,
column order mismatch, etc. When it's still impossible to check for problems.
At present, it is only recommended that the SELECT command in the Insert Into
statement be run to export the data to a file, and then import the file through
Stream load to see the specific errors.
+- If the URI specified in the `S3 / hdfs` TVF does not match any files, or if
all matched files are empty, the `S3 / hdfs` TVF will return an empty result
set. In such cases, if you use `DESC FUNCTION` to view the schema of the file,
you will see a dummy column `__dummy_col`, which can be ignored.
+- If the format specified for the TVF is CSV and the file being read is not
empty but the first line of the file is empty, an error will be prompted: `The
first line is empty, can not parse column numbers`. This is because the schema
cannot be parsed from the first line of the file.
-## more help
+## More help
-For more detailed syntax and best practices used by insert into, see
[insert](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT.md)
command manual, you can also enter `HELP INSERT` in the MySql client command
line for more help information.
+For more detailed syntax on INSERT INTO, refer to the [INSERT
INTO](https://doris.apache.org/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT/)
command manual. You can also type `HELP INSERT` at the MySQL client command
line for further information.
diff --git
a/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md
b/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md
index 06592878c895..a43a1640f417 100644
--- a/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md
+++ b/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md
@@ -24,129 +24,246 @@ specific language governing permissions and limitations
under the License.
-->
-# Mysql Load
-<version since="dev">
+Apache Doris is compatible with the MySQL protocol and supports the use of the
standard MySQL [LOAD
DATA](https://dev.mysql.com/doc/refman/8.0/en/load-data.html) syntax to import
local files. MySQL Load is a synchronous import method where the import result
is returned upon completion. That means users can tell whether the import
suceeds from the returned result. Generally, the MySQL Load method can be used
to ingest files smaller than 10GB in size. For files larger than 10GB, it is
reco [...]
-This is an stand syntax of MySql [LOAD
DATA](https://dev.mysql.com/doc/refman/8.0/en/load-data.html) for user to load
local file.
+## Applicable scenarios
-MySql load synchronously executes the import and returns the import result.
The return information will show whether the import is successful for user.
+**Supported format**
-MySql load is mainly suitable for importing local files on the client side, or
importing data from a data stream through a program.
+MySQL Load is primarily designed for importing CSV files from the client's
local machine or importing data from data streams through programs.
-</version>
+**Restrictions**
-## Basic Principles
+When importing CSV files, it is important to differentiate between null values
and empty strings (''):
-The MySql Load are similar with Stream Load. Both import local files into the
Doris cluster, so the MySQL Load will reuses StreamLoad:
+- Null values are represented by the escape sequence \N. For example, in
`a,\N,b`, the middle column represents a null value.
+- Empty strings are represented directly as empty, such as `a, ,b`, where the
middle column represents an empty string.
- 1. FE receives the MySQL Load request executed by the client and then analyse
the SQL
+## Implementation
- 2. FE build the MySql Load request as a StreamLoad request.
+MySQL Load is similar to Stream Load in terms of functionality. They both
involve importing local files into the Doris cluster. Therefore, the
implementation of MySQL Load reuses the basic import capabilities of Stream
Load.
- 3. FE selects a BE node to send a StreamLoad request
+The main processes of MySQL Load include:
- 4. When sending the request, FE will read the local file data from the MySQL
client side streamingly, and send it to the HTTP request of StreamLoad
asynchronously.
+1. The user submits a LOAD DATA request to the frontend (FE), which performs
the parsing and encapsulates the request into a Stream Load task.
+2. The FE selects a backend (BE) node and sends the Stream Load request to it.
+3. Meanwhile, the FE reads the local file data from the MySQL client in an
asynchronous and streaming manner and sends it in real time to the HTTP request
of the Stream Load.
+4. Once the data transfer from the MySQL client is complete, the FE waits for
the Stream Load to finish and displays the import result (success or failure)
to the client.
- 5. After the data transfer on the MySQL client side is completed, FE waits
for the StreamLoad to complete, and displays the import success or failure
information to the client side.
+## Get started
+### Preparations
-## Support data format
+MySQL Load requires INSERT permission on the target table. You can grant
permissions to user account using the GRANT command.
-MySql Load currently only supports data formats: CSV (text).
+### Create a MySQL Load job
-## Basic operations
+1. Prepare the test data
-### client connection
-```bash
-mysql --local-infile -h 127.0.0.1 -P 9030 -u root -D testdb
+Create a data file `client_local.csv` containing the following sample data:
+
+```SQL
+1,10
+2,20
+3,30
+4,40
+5,50
+6,60
```
-Notice that if you wants to use mysql load, you must connect doris server with
`--local-infile` in client command.
-If you're use jdbc to connect doris, you must add property named
`allowLoadLocalInfile=true` in jdbc url.
+2. Connect to MySQL client
+Connect to the MySQL client before executing the LOAD DATA command:
-### Create test table
-```sql
- CREATE TABLE testdb.t1 (pk INT, v1 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY
hash (pk) PROPERTIES ('replication_num' = '1');
- ```
- ### import file from client node
- Suppose there is a CSV file named 'client_local.csv 'on the current path of
the client side, which will be imported into the test table'testdb.t1' using
the MySQL LOAD syntax.
+```Shell
+mysql --local-infile -h <fe_ip> -P <fe_query_port> -u root -D testdb
+```
-```sql
-LOAD DATA LOCAL
-INFILE 'client_local.csv '
-INTO TABLE testdb.t1
-PARTITION (partition_a, partition_b, partition_c, partition_d)
-COLUMNS TERMINATED BY '\ t'
-LINES TERMINATED BY '\ n'
-IGNORE 1 LINES
-(K1, k2, v2, v10, v11)
-SET (c1 = k1, c2 = k2, c3 = v10, c4 = v11)
-PROPERTIES ("strict_mode" = "true")
+:::caution
+Specific parameter options need to be used during the connection:
+
+1. When connecting to the MySQL client, `--local-infile` must be included,
otherwise an error might be thrown.
+2. When connecting via JDBC, `allowLoadLocalInfile=true` must be specified in
the URL.
+:::
+
+3. Create the test table
+
+Create a table as follows in Doris:
+
+```SQL
+CREATE TABLE testdb.t1 (
+ pk INT,
+ v1 INT SUM
+) AGGREGATE KEY (pk)
+DISTRIBUTED BY hash (pk);
```
-1. MySQL Load starts with the syntax `LOAD DATA`, and specifying `LOCAL` means
reading client side files.
-2. The local fill path will be filled after `INFILE`, which can be a relative
path or an absolute path. Currently only a single file is supported, and
multiple files are not supported
-3. The table name after `INTO TABLE` can specify the database name, as shown
in the case. It can also be omitted, and the current database for the user will
be used.
-4. `PARTITION` syntax supports specified partition to import
-5. `COLUMNS TERMINATED BY` specifies the column separator
-6. `LINES TERMINATED BY` specifies the line separator
-7. `IGNORE num LINES` skips the num header of the CSV.
-8. Column mapping syntax, see the column mapping chapter of [Imported Data
Transformation](../import-scenes/load-data-convert.md) for specific parameters
-9. `PROPERTIES` is the configuration of import, please refer to the [MySQL
Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/MYSQL-LOAD.md)
command manual for specific properties.
-
-### import file from fe server node
-Assuming that the '/root/server_local.csv' path on the FE node is a CSV file,
use the MySQL client side to connect to the corresponding FE node, and then
execute the following command to import data into the test table.
-
-```sql
-LOAD DATA
-INFILE '/root/server_local.csv'
+
+4. Run the LOAD DATA command
+
+After connecting to the MySQL client, create a Load job. The command is as
follows:
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'client_local.csv'
INTO TABLE testdb.t1
-PARTITION (partition_a, partition_b, partition_c, partition_d)
-COLUMNS TERMINATED BY '\ t'
-LINES TERMINATED BY '\ n'
-IGNORE 1 LINES
-(K1, k2, v2, v10, v11)
-SET (c1 = k1, c2 = k2, c3 = v10, c4 = v11)
-PROPERTIES ("strict_mode" = "true")
+COLUMNS TERMINATED BY ','
+LINES TERMINATED BY '\n';
```
-1. The only difference between the syntax of importing server level local
files and importing client side syntax is whether the'LOCAL 'keyword is added
after the'LOAD DATA' keyword.
-2. FE will have multi-nodes, and importing server level files can only import
FE nodes connected by the client side, and cannot import files local to other
FE nodes.
-3. Server side load was disabled by default. Enable it by setting
`mysql_load_server_secure_path` with a secure path. All the load file should be
under this path.
-### Return result
-Since MySQL load is a synchronous import method, the imported results are
returned to the user through SQL syntax.
-If the import fails, a specific error message will be displayed. If the import
is successful, the number of imported rows will be displayed.
+### View results
-```Text
-Query OK, 1 row affected (0.17 sec)
-Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
+MySQL Load is a synchronous import method, and the results of the import are
returned to the user in the command-line interface. If the import execution
fails, it will display specific error messages.
+
+Below is an example of a successful import result, which returns the number of
imported rows:
+
+```SQL
+Query OK, 6 row affected (0.17 sec)
+Records: 6 Deleted: 0 Skipped: 0 Warnings: 0
```
-### Error result
-If mysql load process goes wrong, it will show the error in the client as
below:
-```text
+When there are exceptions during the import, the corresponding error will be
displayed on the client:
+
+```SQL
ERROR 1105 (HY000): errCode = 2, detailMessage = [DATA_QUALITY_ERROR]too many
filtered rows with load id b612907c-ccf4-4ac2-82fe-107ece655f0f
```
-If you meets this error, you can extract the `loadId` and use it in the `show
load warnings` command to get more detail message.
-```sql
+The `loadId` is included in the error message, based on which you can view the
detailed information via the `show load warnings` command:
+
+```SQL
show load warnings where label='b612907c-ccf4-4ac2-82fe-107ece655f0f';
```
-The loadId was the label in this case.
+### Cancel a MySQL Load job
+
+Doris does not allow manual cancellation of MySQL Load jobs. In the event of a
timeout or import error, the corresponding MySQL Load job will be automatically
cancelled by the system.
+
+## Manual
+
+### Syntax
+
+The syntax for LOAD DATA is as follows:
+
+```SQL
+LOAD DATA LOCAL
+INFILE '<load_data_file>'
+INTO TABLE [<db_name>.]<table_name>
+[PARTITION (partition_name [, partition_name] ...)]
+[COLUMNS TERMINATED BY '<column_terminated_operator>']
+[LINES TERMINATED BY '<line_terminated_operator>']
+[IGNORE <ignore_lines> LINES]
+[(col_name_or_user_var[, col_name_or_user_var] ...)]
+[SET col_name={expr | DEFAULT}[, col_name={expr | DEFAULT}] ...]
+[PROPERTIES (key1 = value1 [, key2=value2]) ]
+```
+Descriptions of modules in the Load job:
-### Configuration
-1. `mysql_load_thread_pool`: the thread pool size for singe FE node, set 4
thread by default. The block queue size is 5 times of `mysql_load_thread_pool`.
So FE can accept 4 + 4\*5 = 24 requests in one time. Increase this
configuration if the parallelism are larger than 24.
-2. `mysql_load_server_secure_path`: the secure path for load data from server.
Empty path by default means that it's not allowed for server load. Recommend to
create a `local_import_data` directory under `DORIS_HOME` to load data if you
want enable it.
-3. `mysql_load_in_memory_record` The failed mysql load record size. The record
was keep in memory and only have 20 records by default. If you want to track
more records, you can rise the config but be careful about the fe memory. This
record will expired after one day and there is a async thread to clean it in
every day.
+| Module | Description
|
+| --------------------- |
------------------------------------------------------------ |
+| INFILE | This specifies the local file path, which can be
either a relative path or an absolute path.Currently, load_data_file only
supports a single file. |
+| INTO TABLE | This specifies the database and table, and the
database name can be omitted. |
+| PARTITION | This specifies the target partition. If the user can
determine the partition corresponding to the data, it is recommended to specify
this. Data that does not fit into the specified partitions will be filtered
out. |
+| COLUMNS TERMINATED BY | This specifies the column delimiter.
|
+| LINE TERMINATED BY | This specifies the row delimiter.
|
+| IGNORE num LINES | This specifies the number of rows to skip in the CSV
import, typically specified as 1 to skip the header. |
+| col_name_or_user_var | This specifies the column mapping syntax. For more
information, refer to [Column
Mapping](https://doris.apache.org/docs/2.0/data-operate/import/load-data-convert#column-mapping).
|
+| PROPERTIES | Parameters for the Load.
|
+### Parameters
-## Notice
+By the `PROPERTIES (key1 = value1 [, key2=value2])` syntax, you can configure
the parameters for the Load.
+
+| Parameter | Description
|
+| ------------------ |
------------------------------------------------------------ |
+| max_filter_ratio | The maximum filtering rate allowed. Must be between 0
and 1, inclusive. The default value is 0, indicating no tolerance for any error
rows. |
+| timeout | The import timeout, measured in seconds. The default
value is 600. The range allowed is from 1s to 259200s. |
+| strict_mode | Whether to enable strict mode for this import. Disabled
by default. |
+| timezone | The time zone for this import. The default time zone is
UTC+8. This parameter will affect the results of any time zone-related
functions involved in the import. |
+| exec_mem_limit | Memory limit for the import, defaults to 2GB, measured
in bytes. |
+| trim_double_quotes | Boolean, defaults to false. If this is set to true, the
outermost double quotes will be trimmed from each field in the import file. |
+| enclose | This specifies the enclosure character. When a CSV data
field contains line breaks or column delimiters, you can specify a single-byte
character as the enclosure character to prevent accidental truncation.For
example, if the column delimiter is ",", and the enclosure character is "'", in
data "a,'b,c'", "b,c" will be parsed as one field. |
+| escape | This specifies the escape character. This is used when
the data contains the same character as the enclosure character, which needs to
be treated as part of the field.For example, if the data is "a,'b,'c'", the
enclosure character is "'", and you want "b,'c" to be parsed as one field, you
need to specify a single-byte escape character, such as "\", to modify the data
to "a,'b,\'c'". |
+
+## Example
+
+### Specify load timeout
+
+You can adjust the import timeout by specifying `timeout` in PROPERTIES. For
example, set it to 100s:
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+INTO TABLE testDb.testTbl
+PROPERTIES ("timeout"="100");
+```
-1. If you see this `LOAD DATA LOCAL INFILE file request rejected due to
restrictions on access` message, you should connet mysql with `mysql
--local-infile=1` command to enable client to load local file.
-2. The configuration for StreamLoad will also affect MySQL Load. Such as the
configurate in be named `streaming_load_max_mb`, it's 10GB by default and it
will control the max size for one load.
+### Specify allowable error rate
+
+You can adjust the allowable error rate by specifying `max_filter_ratio` in
PROPERTIES. For example, set it to 20%:
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+INTO TABLE testDb.testTbl
+PROPERTIES ("max_filter_ratio"="0.2");
+```
+
+### Import column mapping
+
+The following example adjusts the order of columns in the CSV file.
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+INTO TABLE testDb.testTbl
+(k2, k1, v1);
+```
+
+### Specify column and row delimiters
+
+You can specify the column and row delimiters using the `COLUMNS TERMINATED
BY` and `LINES TERMINATED BY` clauses. In the following example, (,) and (\n)
are used as the column and row delimiters, respectively.
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+COLUMNS TERMINATED BY ','
+LINES TERMINATED BY '\n'
+INTO TABLE testDb.testTbl;
+```
+
+### Specify target partition
+
+You can specify the target partition for the import using the `PARTITION`
clause. In the following example, data will be loaded into the specified
partitions 'p1' and 'p2', and any data that does not belong to these two
partitions will be filtered out:
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+INTO TABLE testDb.testTbl
+PARTITION (p1, p2);
+```
+
+### Specify time zone
+
+You can specify the `timezone` in PROPERTIES. In the following example, the
timezone is set to Africa/Abidjan:
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+INTO TABLE testDb.testTbl
+PROPERTIES ("timezone"="Africa/Abidjan");
+```
+
+### Specify the memory limit for the import
+
+You can specify the memory limit for the import by the `exec_mem_limit`
parameter in PROPERTIES. In the following example, the memory limit is set to
10G:
+
+```SQL
+LOAD DATA LOCAL
+INFILE 'testData'
+INTO TABLE testDb.testTbl
+PROPERTIES ("exec_mem_limit"="10737418240");
+```
-## More Help
+## More help
-1. For more detailed syntax and best practices for using MySQL Load, see the
[MySQL
Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/MYSQL-LOAD.md)
command manual.
+For more detailed syntax and best practices related to MySQL Load, refer to
the [MySQL
Load](https://doris.apache.org/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/MYSQL-LOAD/)
command manual.
diff --git a/versioned_docs/version-2.0/table-design/auto-increment.md
b/versioned_docs/version-2.0/table-design/auto-increment.md
index 9cc51cae8c04..e8031ce39d71 100644
--- a/versioned_docs/version-2.0/table-design/auto-increment.md
+++ b/versioned_docs/version-2.0/table-design/auto-increment.md
@@ -24,126 +24,66 @@ specific language governing permissions and limitations
under the License.
-->
-# AUTO_INCREMENT Column
+:::tip
+Auto-increment column is supported in Doris since version 2.1. The
documentation of V2.0 only provides an introduction to its functionality and
use cases. If you are interested in this feature, please upgrade to Doris 2.1
and refer to the relevant part in the documentation of V2.1.
+:::
-<version since="2.1">
+For tables that enable an AUTO_INCREMENT Column, Doris assigns a table-unique
value to rows that do not have a specified value in the auto-increment column.
-</version>
+## Description
-When importing data, Doris assigns a table-unique value to rows that do not
have specified values in the auto-increment column.
+For tables that contain an auto-increment column, during data ingestion:
-## Functionality
-
-For tables containing an auto-increment column, during data import:
-- If the target columns don't include the auto-increment column, Doris will
populate the auto-increment column with generated values.
-- If the target columns include the auto-increment column, null values in the
imported data for that column will be replaced by values generated by Doris,
while non-null values will remain unchanged. Note that **non-null values can
disrupt the uniqueness of the auto-increment column values**.
+- If the auto-increment column is **NOT** one of the target columns of this
import, it will be auto-filled with values generated by Doris. The auto-filled
values are unique for each row.
+- If the auto-increment column is one of the target columns of this import,
the imported null values in this column will be replaced with values generated
by Doris, while non-null values will remain unchanged. Note that the existence
of the imported non-null values might compromise the uniqueness of values in
the auto-increment columns.
### Uniqueness
-Doris ensures that values generated on the auto-increment column have
**table-wide uniqueness**. However, it's important to note that **the
uniqueness of the auto-increment column only guarantees uniqueness for values
automatically filled by Doris and does not consider values provided by users**.
If a user explicitly inserts user-provided values for this table by specifying
the auto-increment column, this uniqueness cannot be guaranteed.
-
-### Density
-
-Doris ensures that the values generated on the auto-increment column are
dense, but it **cannot guarantee** that the values automatically generated in
the auto-increment column during an import will be entirely contiguous. Thus,
there might be some jumps in the values generated by the auto-increment column
during an import. This is because, for performance consideration, each BE
caches a portion of pre-allocated auto-increment column values, and these
cached values do not intersect betwe [...]
-
-## Syntax
+Doris ensures that the values generated on the auto-increment column are
unique within the table. However, Doris can only guarantee the uniqueness of
values generated by itself. If the user inserts values into the auto-increment
column explicitly, it does not guarantee the table-wide uniqueness of such
values.
-To use auto-increment columns, you need to add the `AUTO_INCREMENT` attribute
to the corresponding column during table creation
([CREATE-TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE)).
To manually specify the starting value for an auto-increment column, you can
do so by using the `AUTO_INCREMENT(start_value)` statement when creating the
table. If not specified, the default starting value is 1.
+### Sequentialness
-### Examples
+Doris ensures that the values generated on the auto-increment column are
dense. However, it does not guarantee that the values populated by
auto-increment during a single import will be completely continuous. This is
because, for performance reasons, each backend (BE) caches a portion of
pre-allocated auto-increment values, and the cached values do not overlap
between different BEs. Additionally, due to the existence of caching, Doris
cannot guarantee that the auto-increment values gener [...]
-1. Creating a Duplicate table with one key column as an auto-increment column:
-
- ```sql
- CREATE TABLE `demo`.`tbl` (
- `id` BIGINT NOT NULL AUTO_INCREMENT,
- `value` BIGINT NOT NULL
- ) ENGINE=OLAP
- DUPLICATE KEY(`id`)
- DISTRIBUTED BY HASH(`id`) BUCKETS 10
- PROPERTIES (
- "replication_allocation" = "tag.location.default: 3"
- );
+## Usage
-2. Creating a Duplicate table with one key column as an auto-increment column,
and set start value is 100:
+1. Create a table in the Duplicate Key model, with one of the key columns
being an auto-increment column.
- ```sql
- CREATE TABLE `demo`.`tbl` (
- `id` BIGINT NOT NULL AUTO_INCREMENT(100),
- `value` BIGINT NOT NULL
- ) ENGINE=OLAP
- DUPLICATE KEY(`id`)
- DISTRIBUTED BY HASH(`id`) BUCKETS 10
- PROPERTIES (
- "replication_allocation" = "tag.location.default: 3"
- );
- ```
+```SQL
+CREATE TABLE `demo`.`tbl` (
+ `id` BIGINT NOT NULL AUTO_INCREMENT,
+ `value` BIGINT NOT NULL
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 10
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 3"
+);
+```
-3. Creating a Duplicate table with one value column as an auto-increment
column:
+2. Create a table in the Duplicate Key model, with one of the value columns
being an auto-increment column.
- ```sql
- CREATE TABLE `demo`.`tbl` (
+```SQL
+CREATE TABLE `demo`.`tbl` (
`uid` BIGINT NOT NULL,
`name` BIGINT NOT NULL,
`id` BIGINT NOT NULL AUTO_INCREMENT,
`value` BIGINT NOT NULL
- ) ENGINE=OLAP
- DUPLICATE KEY(`uid`, `name`)
- DISTRIBUTED BY HASH(`uid`) BUCKETS 10
- PROPERTIES (
- "replication_allocation" = "tag.location.default: 3"
- );
- ```
-
-4. Creating a Unique tbl table with one key column as an auto-increment column:
-
- ```sql
- CREATE TABLE `demo`.`tbl` (
- `id` BIGINT NOT NULL AUTO_INCREMENT,
- `name` varchar(65533) NOT NULL,
- `value` int(11) NOT NULL
- ) ENGINE=OLAP
- UNIQUE KEY(`id`)
- DISTRIBUTED BY HASH(`id`) BUCKETS 10
- PROPERTIES (
- "replication_allocation" = "tag.location.default: 3",
- "enable_unique_key_merge_on_write" = "true"
- );
- ```
-
-5. Creating a Unique tbl table with one value column as an auto-increment
column:
-
- ```sql
- CREATE TABLE `demo`.`tbl` (
- `text` varchar(65533) NOT NULL,
- `id` BIGINT NOT NULL AUTO_INCREMENT,
- ) ENGINE=OLAP
- UNIQUE KEY(`text`)
- DISTRIBUTED BY HASH(`text`) BUCKETS 10
- PROPERTIES (
- "replication_allocation" = "tag.location.default: 3",
- "enable_unique_key_merge_on_write" = "true"
- );
- ```
-
-### Constraints and Limitations
-
-- Only Duplicate model tables and Unique model tables can contain
auto-increment columns.
-- A table can contain at most one auto-increment column.
-- The type of the auto-increment column must be BIGINT and must be NOT NULL.
-- The manually specified starting value for an auto-increment column must be
greater than or equal to 0.
-
-## Usage
-
-### Import
+) ENGINE=OLAP
+DUPLICATE KEY(`uid`, `name`)
+DISTRIBUTED BY HASH(`uid`) BUCKETS 10
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 3"
+);
+```
-Consider the following table:
+3. Create a table in the Unique Key model, with one of the key columns being
an auto-increment column.
-```sql
+```SQL
CREATE TABLE `demo`.`tbl` (
- `id` BIGINT NOT NULL AUTO_INCREMENT,
- `name` varchar(65533) NOT NULL,
- `value` int(11) NOT NULL
+ `id` BIGINT NOT NULL AUTO_INCREMENT,
+ `name` varchar(65533) NOT NULL,
+ `value` int(11) NOT NULL
) ENGINE=OLAP
UNIQUE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
@@ -153,203 +93,30 @@ PROPERTIES (
);
```
-When using the insert into statement to import data without specifying the
auto-increment column `id`, the `id` column will automatically be filled with
generated values.
-
-```sql
-mysql> insert into tbl(name, value) values("Bob", 10), ("Alice", 20), ("Jack",
30);
-Query OK, 3 rows affected (0.09 sec)
-{'label':'label_183babcb84ad4023_a2d6266ab73fb5aa', 'status':'VISIBLE',
'txnId':'7'}
-
-mysql> select * from tbl order by id;
-+------+-------+-------+
-| id | name | value |
-+------+-------+-------+
-| 1 | Bob | 10 |
-| 2 | Alice | 20 |
-| 3 | Jack | 30 |
-+------+-------+-------+
-3 rows in set (0.05 sec)
-```
-
-Similarly, using stream load to import the file test.csv without specifying
the auto-increment column `id` will result in the id column being automatically
filled with generated values.
+4. Create a table in the Unique Key model, with one of the value columns being
an auto-increment column.
-test.csv:
-```
-Tom, 40
-John, 50
-```
-
-```
-curl --location-trusted -u user:passwd -H "columns:name,value" -H
"column_separator:," -T ./test1.csv
http://{host}:{port}/api/{db}/tbl/_stream_load
-```
-
-```sql
-mysql> select * from tbl order by id;
-+------+-------+-------+
-| id | name | value |
-+------+-------+-------+
-| 1 | Bob | 10 |
-| 2 | Alice | 20 |
-| 3 | Jack | 30 |
-| 4 | Tom | 40 |
-| 5 | John | 50 |
-+------+-------+-------+
-5 rows in set (0.04 sec)
-```
-
-When importing using insert into statement while specifying the auto-increment
column `id`, null values in the imported data for that column will be replaced
by generated values.
-
-```sql
-mysql> insert into tbl(id, name, value) values(null, "Doris", 60), (null,
"Nereids", 70);
-Query OK, 2 rows affected (0.07 sec)
-{'label':'label_9cb0c01db1a0402c_a2b8b44c11ce4703', 'status':'VISIBLE',
'txnId':'10'}
-
-mysql> select * from tbl order by id;
-+------+---------+-------+
-| id | name | value |
-+------+---------+-------+
-| 1 | Bob | 10 |
-| 2 | Alice | 20 |
-| 3 | Jack | 30 |
-| 4 | Tom | 40 |
-| 5 | John | 50 |
-| 6 | Doris | 60 |
-| 7 | Nereids | 70 |
-+------+---------+-------+
-7 rows in set (0.04 sec)
-```
-
-### Partial Update
-
-When performing a partial update on a merge-on-write Unique table containing
an auto-increment column:
-
-If the auto-increment column is a key column, during partial updates, as users
must explicitly specify the key column, the target columns for partial column
updates must include the auto-increment column. In this scenario, the import
behavior is similar to regular partial updates.
-
-```sql
-mysql> CREATE TABLE `demo`.`tbl2` (
- -> `id` BIGINT NOT NULL AUTO_INCREMENT,
- -> `name` varchar(65533) NOT NULL,
- -> `value` int(11) NOT NULL DEFAULT "0"
- -> ) ENGINE=OLAP
- -> UNIQUE KEY(`id`)
- -> DISTRIBUTED BY HASH(`id`) BUCKETS 10
- -> PROPERTIES (
- -> "replication_allocation" = "tag.location.default: 3",
- -> "enable_unique_key_merge_on_write" = "true"
- -> );
-Query OK, 0 rows affected (0.03 sec)
-
-mysql> insert into tbl2(id, name, value) values(1, "Bob", 10), (2, "Alice",
20), (3, "Jack", 30);
-Query OK, 3 rows affected (0.14 sec)
-{'label':'label_5538549c866240b6_bce75ef323ac22a0', 'status':'VISIBLE',
'txnId':'1004'}
-
-mysql> select * from tbl2 order by id;
-+------+-------+-------+
-| id | name | value |
-+------+-------+-------+
-| 1 | Bob | 10 |
-| 2 | Alice | 20 |
-| 3 | Jack | 30 |
-+------+-------+-------+
-3 rows in set (0.08 sec)
-
-mysql> set enable_unique_key_partial_update=true;
-Query OK, 0 rows affected (0.01 sec)
-
-mysql> set enable_insert_strict=false;
-Query OK, 0 rows affected (0.00 sec)
-
-mysql> insert into tbl2(id, name) values(1, "modified"), (4, "added");
-Query OK, 2 rows affected (0.06 sec)
-{'label':'label_3e68324cfd87457d_a6166cc0a878cfdc', 'status':'VISIBLE',
'txnId':'1005'}
-
-mysql> select * from tbl2 order by id;
-+------+----------+-------+
-| id | name | value |
-+------+----------+-------+
-| 1 | modified | 10 |
-| 2 | Alice | 20 |
-| 3 | Jack | 30 |
-| 4 | added | 0 |
-+------+----------+-------+
-4 rows in set (0.04 sec)
-```
-
-When the auto-increment column is a non-key column and users haven't specified
the value for the auto-increment column, the value will be filled from existing
data rows in the table. If users specify the auto-increment column, null values
in the imported data for that column will be replaced by generated values,
while non-null values will remain unchanged, and then these data will be loaded
with the semantics of partial updates.
-
-```sql
-mysql> CREATE TABLE `demo`.`tbl3` (
- -> `id` BIGINT NOT NULL,
- -> `name` varchar(100) NOT NULL,
- -> `score` BIGINT NOT NULL,
- -> `aid` BIGINT NOT NULL AUTO_INCREMENT
- -> ) ENGINE=OLAP
- -> UNIQUE KEY(`id`)
- -> DISTRIBUTED BY HASH(`id`) BUCKETS 1
- -> PROPERTIES (
- -> "replication_allocation" = "tag.location.default: 3",
- -> "enable_unique_key_merge_on_write" = "true"
- -> );
-Query OK, 0 rows affected (0.16 sec)
-
-mysql> insert into tbl3(id, name, score) values(1, "Doris", 100), (2,
"Nereids", 200), (3, "Bob", 300);
-Query OK, 3 rows affected (0.28 sec)
-{'label':'label_c52b2c246e244dda_9b91ee5e27a31f9b', 'status':'VISIBLE',
'txnId':'2003'}
-
-mysql> select * from tbl3 order by id;
-+------+---------+-------+------+
-| id | name | score | aid |
-+------+---------+-------+------+
-| 1 | Doris | 100 | 0 |
-| 2 | Nereids | 200 | 1 |
-| 3 | Bob | 300 | 2 |
-+------+---------+-------+------+
-3 rows in set (0.13 sec)
-
-mysql> set enable_unique_key_partial_update=true;
-Query OK, 0 rows affected (0.00 sec)
-
-mysql> set enable_insert_strict=false;
-Query OK, 0 rows affected (0.00 sec)
-
-mysql> insert into tbl3(id, score) values(1, 999), (2, 888);
-Query OK, 2 rows affected (0.07 sec)
-{'label':'label_dfec927d7a4343ca_9f9ade581391de97', 'status':'VISIBLE',
'txnId':'2004'}
-
-mysql> select * from tbl3 order by id;
-+------+---------+-------+------+
-| id | name | score | aid |
-+------+---------+-------+------+
-| 1 | Doris | 999 | 0 |
-| 2 | Nereids | 888 | 1 |
-| 3 | Bob | 300 | 2 |
-+------+---------+-------+------+
-3 rows in set (0.06 sec)
-
-mysql> insert into tbl3(id, aid) values(1, 1000), (3, 500);
-Query OK, 2 rows affected (0.07 sec)
-{'label':'label_b26012959f714f60_abe23c87a06aa0bf', 'status':'VISIBLE',
'txnId':'2005'}
-
-mysql> select * from tbl3 order by id;
-+------+---------+-------+------+
-| id | name | score | aid |
-+------+---------+-------+------+
-| 1 | Doris | 999 | 1000 |
-| 2 | Nereids | 888 | 1 |
-| 3 | Bob | 300 | 500 |
-+------+---------+-------+------+
-3 rows in set (0.06 sec)
+```SQL
+CREATE TABLE `demo`.`tbl` (
+ `text` varchar(65533) NOT NULL,
+ `id` BIGINT NOT NULL AUTO_INCREMENT,
+) ENGINE=OLAP
+UNIQUE KEY(`text`)
+DISTRIBUTED BY HASH(`text`) BUCKETS 10
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 3",
+"enable_unique_key_merge_on_write" = "true"
+);
```
-## Usage Scenarios
+## Applicable scenarios
-### Dictionary Encoding
+### Dictionary encoding
-Using bitmaps for audience analysis in user profile requires building a user
dictionary where each user corresponds to a unique integer dictionary value.
Aggregating these dictionary values can improve the performance of bitmap.
+In user profiling, where bitmap is utilized for user group analysis, analysts
often need to build a dictionary, in which each user corresponds to a unique
integer value. Sequential dictionary values can improve bitmap performance.
-Taking the offline UV and PV analysis scenario as an example, assuming there's
a detailed user behavior table:
+Take offline UV (Unique Visitors) and PV (Page Views) analysis as an example,
assume that all the detailed data is stored in the following user behavior
table:
-```sql
+```SQL
CREATE TABLE `demo`.`dwd_dup_tbl` (
`user_id` varchar(50) NOT NULL,
`dim1` varchar(50) NOT NULL,
@@ -366,10 +133,9 @@ PROPERTIES (
);
```
-Using the auto-incrementa column to create the following dictionary table:
-
+Create a dictionary table using AUTO_INCREMENT:
-```sql
+```SQL
CREATE TABLE `demo`.`dictionary_tbl` (
`user_id` varchar(50) NOT NULL,
`aid` BIGINT NOT NULL AUTO_INCREMENT
@@ -382,27 +148,26 @@ PROPERTIES (
);
```
-Import the value of `user_id` from existing data into the dictionary table,
establishing the mapping of `user_id` to integer values:
+Load the existing `user_id` into the dictionary table, and create mappings
from `user_id` to integer values:
-```sql
+```SQL
insert into dit_tbl(user_id)
select user_id from dwd_dup_tbl group by user_id;
```
-Or import only the value of `user_id` in incrementa data into the dictionary
table alternatively:
+Or load the incremental `user_id` into the dictionary table:
-
-```sql
+```SQL
insert into dit_tbl(user_id)
select dwd_dup_tbl.user_id from dwd_dup_tbl left join dictionary_tbl
-on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time >
'2023-12-10' and dictionary_tbl.user_id is NULL;
+on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time
'2023-12-10' and dictionary_tbl.user_id is NULL;
```
-In real-world scenarios, Flink connectors can also be employed to write data
into Doris.
+You can also use the Flink-Doris-Connector to load data into Doris.
-Assuming `dim1`, `dim3`, `dim5` represent statistical dimensions of interest
to us, create the following table to store aggregated results:
+Suppose that `dim1`, `dim3`, `dim5` are the analytic dimensions to be
considered, create an Aggregate Key table to accommodate the results of data
aggregation.
-```sql
+```SQL
CREATE TABLE `demo`.`dws_agg_tbl` (
`dim1` varchar(50) NOT NULL,
`dim3` varchar(50) NOT NULL,
@@ -417,25 +182,25 @@ PROPERTIES (
);
```
-Store the result of the data aggregation operations into the aggregation
result table:
+Load the aggregated results into the table:
-```sql
+```SQL
insert into dws_tbl
select dwd_dup_tbl.dim1, dwd_dup_tbl.dim3, dwd_dup_tbl.dim5,
BITMAP_UNION(TO_BITMAP(dictionary_tbl.aid)), COUNT(1)
from dwd_dup_tbl INNER JOIN dictionary_tbl on dwd_dup_tbl.user_id =
dictionary_tbl.user_id;
```
-Perform UV and PV queries using the following statement:
+Then query PV/UV using the following statement:
-```sql
+```SQL
select dim1, dim3, dim5, user_id_bitmap as uv, pv from dws_agg_tbl;
```
-### Efficient Pagination
+### Efficient pagination
-When displaying data on a page, pagination is often necessary. Traditional
pagination typically involves using `limit`, `offset`, and `order by` in SQL
queries. For instance, consider the following business table intended for
display:
+Pagination is often required in data display. Typically, pagination is
implemented by SQL statements like `limit` or `offset` + `order by`. Suppose
that you need to display the following table:
-```sql
+```SQL
CREATE TABLE `demo`.`records_tbl` (
`key` int(11) NOT NULL COMMENT "",
`name` varchar(26) NOT NULL COMMENT "",
@@ -452,23 +217,23 @@ PROPERTIES (
);
```
-Assuming 100 records are displayed per page in pagination. To fetch the first
page's data, the following SQL query can be used:
+In pagination queries, if each page displays 100 results, you can retrieve the
first page using the following SQL statement:
-```sql
-select * from records_tbl order by `key`, `name` limit 100;
+```SQL
+select * from records_tbl order by key, name limit 100;
```
-Fetching the data for the second page can be accomplished by:
+You can retrieve the second page using the following SQL statement:
-```sql
-select * from records_tbl order by `key`, `name` limit 100, offset 100;
+```SQL
+select * from records_tbl order by key, name limit 100, offset 100;
```
-However, when performing deep pagination queries (with large offsets), even if
the actual required data rows are few, this method still reads all data into
memory for full sorting before subsequent processing, which is quite
inefficient. Using an auto-incrementa column assigns a unique value to each
row, allowing the use of where `unique_value` > x limit y to filter a
significant amount of data beforehand, making pagination more efficient.
+However, when performing deep pagination queries (with a large offset), even
if only a few rows of data are actually requested, the previous method still
reads all the data into memory, performs a full sorting, and then proceeds with
further processing. For higher execution efficiency, you can assign a unique
value to each row of data using an auto-increment column. This method records
the maximum value of the `unique_value` column from the previous page and uses
the `where unique_value [...]
-Continuing with the aforementioned business table, an auto-increment column is
added to the table to give each row a unique identifier:
+In the above example table, you can add an auto-increment column to the table
to assign a unique identifier to each row:
-```sql
+```SQL
CREATE TABLE `demo`.`records_tbl2` (
`key` int(11) NOT NULL COMMENT "",
`name` varchar(26) NOT NULL COMMENT "",
@@ -486,23 +251,25 @@ PROPERTIES (
);
```
-For pagination displaying 100 records per page, to fetch the first page's
data, the following SQL query can be used:
+In pagination queries, suppose that each page displays 100 results, you can
retrieve the first page by:
-```sql
+```SQL
select * from records_tbl2 order by unique_value limit 100;
```
-By recording the maximum value of unique_value in the returned results, let's
assume it's 99. The following query can then fetch data for the second page:
+Use programs to record the maximum `unique_value` in the returned result.
Suppose that the maximum is 99, you can query data from the second page by:
-```sql
-select * from records_tbl2 where unique_value > 99 order by unique_value limit
100;
+```SQL
+select * from records_tbl2 where unique_value 99 order by unique_value limit
100;
```
-If directly querying contents from a later page and it's inconvenient to
directly obtain the maximum value of `unique_value` from the preceding page's
data (for instance, directly obtaining contents from the 101st page), the
following query can be used:
+If you need to query data from a deeper page, for example, page 101, which
means it's hard to get the maximum `unique_value` from the previous page
directly, then you can use the statement as follows:
-```sql
+```SQL
select key, name, address, city, nation, region, phone, mktsegment
-from records_tbl2, (select unique_value as max_value from records_tbl2 order
by unique_value limit 1 offset 9999) as previous_data
-where records_tbl2.unique_value > previous_data.max_value
-order by records_tbl2.unique_value limit 100;
+from records_tbl2, (select uniuqe_value as max_value from records_tbl2 order
by uniuqe_value limit 1 offset 9999) as previous_data
+where records_tbl2.uniuqe_value previous_data.max_value
+order by unique_value limit 100;
```
+
+
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]