(doris-website) branch master updated: [Opt](load) add error data handling (#917)

dataroaring Tue, 30 Jul 2024 07:11:01 -0700

This is an automated email from the ASF dual-hosted git repository.

dataroaring pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 023a96b0cf7 [Opt](load) add error data handling (#917)
023a96b0cf7 is described below

commit 023a96b0cf7b962b94e611be1925cd0d61f634b7
Author: Xin Liao <[email protected]>
AuthorDate: Tue Jul 30 22:10:50 2024 +0800

    [Opt](load) add error data handling (#917)
---
 docs/data-operate/import/error-data-handling.md    | 282 +++++++++++++++++++++
 docs/data-operate/import/load-data-convert.md      |  22 --
 docs/data-operate/import/load-strict-mode.md       | 179 -------------
 ...{load-strict-mode.md => error-data-handling.md} | 242 ++++++++++++------
 .../data-operate/import/load-data-convert.md       |  22 --
 sidebars.json                                      |   6 +-
 6 files changed, 454 insertions(+), 299 deletions(-)

diff --git a/docs/data-operate/import/error-data-handling.md 
b/docs/data-operate/import/error-data-handling.md
new file mode 100644
index 00000000000..0d534e1f737
--- /dev/null
+++ b/docs/data-operate/import/error-data-handling.md
@@ -0,0 +1,282 @@
+---
+{
+    "title": "Error Data Handling",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+During the import process, the data types of the original columns may not be 
completely consistent with the target columns. The import process will convert 
the values of the original columns with inconsistent data types. During the 
conversion process, failures such as field type mismatch, field overflow, and 
precision mismatch may occur.
+
+Strict mode (strict_mode) is used to control whether to filter out these 
conversion failure error rows during the import process.
+
+The maximum error rate (max_filter_ratio) is used to control the maximum 
proportion of filtered error rows that can be tolerated.
+
+## Strict Mode
+
+Strict mode has two purposes: filtering out error rows with failed column type 
conversions during the import process, and limiting partial column updates to 
only update existing columns.
+
+### Filtering of Failed Column Type Conversions
+
+The filtering strategy of strict mode is as follows:
+
+- When strict mode is disabled, the failed converted error fields will be 
converted to NULL values, and these error rows containing NULL values will be 
imported together with the correct data rows.
+
+- When strict mode is enabled, the failed error rows with conversion failures 
will be filtered out, and only the correct data rows will be imported. Here, 
the error rows refer to the rows where the original data is not NULL, but the 
result after column type conversion is NULL. The term "column type conversion" 
does not include NULL values calculated by functions.
+
+- Both correct data rows and error data rows may contain NULL values. If the 
target column does not allow NULL values, these data rows containing NULL 
values will also be filtered out.
+
+For columns with range restrictions in the import, if the original data can be 
successfully converted by the type conversion but fails the range restriction, 
strict mode does not have any impact on it. For example, if the type is 
`decimal(1,0)` and the original data is 10, it can be converted by the type 
conversion but is not within the range declared by the column. Strict mode does 
not have any impact on this kind of data.
+
+**1. Example with column type TinyInt:**
+
+| Original Data Type | Original Data Example | Value after Conversion to 
TinyInt | Strict Mode | Result           |
+| ------------------ | -------------------- | 
--------------------------------- | ----------- | ---------------- |
+| NULL               | \N                   | NULL                             
 | Enabled or Disabled | NULL             |
+| Non-NULL           | "abc" or 2000        | NULL                             
 | Enabled     | Invalid value (filtered out) |
+| Non-NULL           | "abc"                | NULL                             
 | Disabled    | NULL             |
+| Non-NULL           | 1                    | 1                                
 | Enabled or Disabled | Correct import   |
+
+:::tip
+1. The columns in the table allow importing NULL values.
+
+2. `abc` and `2000` will be converted to NULL after being converted to TinyInt 
due to type or precision issues. In the case of strict mode being enabled, 
these data will be filtered out. If it is disabled, `null` will be imported.
+:::
+
+**2. Example with column type Decimal(1,0):**
+
+| Original Data Type | Original Data Example | Value after Conversion to 
Decimal | Strict Mode | Result           |
+| ------------------ | -------------------- | 
--------------------------------- | ----------- | ---------------- |
+| NULL               | \N                   | null                             
 | Enabled or Disabled | NULL             |
+| Non-NULL           | aaa                  | NULL                             
 | Enabled     | Invalid value (filtered out) |
+| Non-NULL           | aaa                  | NULL                             
 | Disabled    | NULL             |
+| Non-NULL           | 1 or 10              | 1 or 10                          
 | Enabled or Disabled | Correct import   |
+
+:::tip
+1. The columns in the table allow importing NULL values.
+
+2. `abc` will be converted to NULL after being converted to Decimal due to 
type issues. In the case of strict mode being enabled, these data will be 
filtered out. If it is disabled, `null` will be imported.
+
+3. Although `10` is a value that exceeds the range, it is not affected by 
strict mode because its type meets the requirements of decimal. `10` will be 
filtered out in other import processing flows, but not by strict mode.
+:::
+
+
+### Limiting Partial Column Updates to Existing Columns
+
+In strict mode, each row of data inserted through partial column updates must 
have a Key that already exists in the table. In non-strict mode, partial column 
updates can update rows with existing Keys or insert new rows with non-existing 
Keys.
+
+For example, consider the following table structure:
+```
+mysql> desc user_profile;
++------------------+-----------------+------+-------+---------+-------+
+| Field            | Type            | Null | Key   | Default | Extra |
++------------------+-----------------+------+-------+---------+-------+
+| id               | INT             | Yes  | true  | NULL    |       |
+| name             | VARCHAR(10)     | Yes  | false | NULL    | NONE  |
+| age              | INT             | Yes  | false | NULL    | NONE  |
+| city             | VARCHAR(10)     | Yes  | false | NULL    | NONE  |
+| balance          | DECIMALV3(9, 0) | Yes  | false | NULL    | NONE  |
+| last_access_time | DATETIME        | Yes  | false | NULL    | NONE  |
++------------------+-----------------+------+-------+---------+-------+
+```
+
+There is a data record in the table as follows:
+
+```sql
+1,"kevin",18,"shenzhen",400,"2023-07-01 12:00:00"
+```
+
+When users use non-strict mode of Stream Load for partial column updates to 
insert the following data into the table:
+
+```sql
+1,500,2023-07-03 12:00:01
+3,23,2023-07-03 12:00:02
+18,9999999,2023-07-03 12:00:03
+```
+
+```bash
+curl  --location-trusted -u root -H "partial_columns:true" -H 
"strict_mode:false" -H "column_separator:," -H 
"columns:id,balance,last_access_time" -T /tmp/test.csv 
http://host:port/api/db1/user_profile/_stream_load
+```
+
+One existing data record in the table will be updated, and two new data 
records will be inserted into the table. For columns in the inserted data where 
the user does not specify a value, if the column has a default value, it will 
be filled with the default value. Otherwise, if the column allows NULL values, 
it will be filled with NULL. If neither of these conditions is met, the 
insertion will fail.
+
+When users use strict mode of Stream Load for partial column updates to insert 
the above data into the table, the import will fail because strict mode is 
enabled and the keys (`(3)`, `(18)`) of the second and third rows are not 
present in the original table.
+
+``` bash
+curl  --location-trusted -u root -H "partial_columns:true" -H 
"strict_mode:true" -H "column_separator:," -H 
"columns:id,balance,last_access_time" -T /tmp/test.csv 
http://host:port/api/db1/user_profile/_stream_load
+```
+
+### Configuration Method
+By default, strict mode is set to False, which means it is disabled. The 
method of setting strict mode varies depending on the import method.
+
+[STREAM LOAD](./import-way/stream-load-manual.md)
+
+   ```bash
+   curl --location-trusted -u user:passwd \
+   -H "strict_mode: true" \
+   -T 1.txt \
+   http://host:port/api/example_db/my_table/_stream_load
+   ```
+
+[BROKER LOAD](./import-way/broker-load-manual.md)
+
+   ```sql
+   LOAD LABEL example_db.exmpale_label_1
+   (
+       DATA INFILE("s3://your_bucket_name/your_file.txt")
+       INTO TABLE load_test
+       COLUMNS TERMINATED BY ","
+   )
+   WITH S3
+   (
+       "AWS_ENDPOINT" = "AWS_ENDPOINT",
+       "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
+       "AWS_SECRET_KEY"="AWS_SECRET_KEY",
+       "AWS_REGION" = "AWS_REGION"
+   )
+   PROPERTIES
+   (
+        "strict_mode" = "true"
+   );
+   ```
+[ROUTINE LOAD](./import-way/routine-load-manual.md)
+
+   ```sql
+   CREATE ROUTINE LOAD example_db.test_job ON my_table
+   PROPERTIES
+   (
+       "strict_mode" = "true"
+   ) 
+   FROM KAFKA
+   (
+       "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
+       "kafka_topic" = "my_topic"
+   );
+   ```
+
+[MySQL Load](./import-way/mysql-load-manual.md)
+
+   ```sql
+   LOAD DATA LOCAL
+   INFILE 'testData'
+   INTO TABLE testDb.testTbl
+   PROPERTIES
+   (
+       "strict_mode" = "true"
+   );
+   ```
+
+[INSERT INTO](./import-way/insert-into-manual.md)
+
+   ```sql
+   SET enable_insert_strict = true;
+   INSERT INTO my_table ...;
+   ```
+
+## Maximum Error Rate
+
+The import task allows users to set a maximum error rate (`max_filter_ratio`). 
If the error rate of the imported data is below the maximum error rate, these 
error rows will be ignored and the other correct data will be imported. 
Otherwise, the import will fail.
+
+### Error Rate Calculation Method
+The data rows processed in the import job can be divided into the following 
three categories:
+
+- Filtered Rows: Data that is filtered out due to data quality issues. Data 
quality issues include type errors, precision errors, string length exceeding 
the limit, mismatched file column count, and data rows filtered out due to 
missing partitions.
+
+- Unselected Rows: Data rows that are filtered out due to 
[pre-filtering](./load-data-convert.md) or 
[post-filtering](./load-data-convert.md) conditions.
+
+- Loaded Rows: Data rows that are successfully imported.
+
+The error rate is calculated as:
+
+```Plain
+#Filtered Rows / (#Filtered Rows + #Loaded Rows)
+```
+
+In other words, `Unselected Rows` will not be included in the error rate 
calculation.
+
+### Configuration Method
+The default value of `max_filter_ratio` is 0, which means that if there is any 
error data, the entire import task will fail.
+
+[Stream Load](./import-way/stream-load-manual.md)
+
+   ```bash
+   curl --location-trusted -u user:passwd \
+   -H "max_filter_ratio: 0.1" \
+   -T 1.txt \
+   http://host:port/api/example_db/my_table/_stream_load
+   ```
+
+[Broker Load](./import-way/broker-load-manual.md)
+
+   ```sql
+   LOAD LABEL example_db.exmpale_label_1
+   (
+        DATA INFILE("s3://your_bucket_name/your_file.txt")
+        INTO TABLE load_test
+        COLUMNS TERMINATED BY ","
+   )
+   WITH S3
+   (
+        "AWS_ENDPOINT" = "AWS_ENDPOINT",
+        "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
+        "AWS_SECRET_KEY"="AWS_SECRET_KEY",
+        "AWS_REGION" = "AWS_REGION"
+   )
+   PROPERTIES
+   (
+        "max_filter_ratio" = "0.1"
+   );
+   ```
+[Routine Load](./import-way/routine-load-manual.md)
+
+   ```sql
+   CREATE ROUTINE LOAD example_db.test_job ON my_table
+   PROPERTIES
+   (
+        "max_filter_ratio" = "0.1"
+   ) 
+   FROM KAFKA
+   (
+        "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
+        "kafka_topic" = "my_topic"
+   );
+   ```
+
+[MySQL Load](./import-way/mysql-load-manual.md)
+
+   ```sql
+   LOAD DATA LOCAL
+   INFILE 'testData'
+   INTO TABLE testDb.testTbl
+   PROPERTIES (
+        "max_filter_ratio"="0.1"
+    );
+   ```
+
+[INSERT INTO](./import-way/insert-into-manual.md)
+
+   ```sql
+   SET insert_max_filter_ratio = 0.1;
+   INSERT INTO my_table FROM S3/HDFS/LOCAL();
+   ```
+:::tip
+Only effective when the value of `enable_insert_strict` is `false`, used to 
control the maximum error rate when using `INSERT INTO FROM S3/HDFS/LOCAL()`. 
The default value is 1.0, which means tolerating all errors.
+::::
\ No newline at end of file
diff --git a/docs/data-operate/import/load-data-convert.md 
b/docs/data-operate/import/load-data-convert.md
index 1d4e98ccd6e..34f401001fa 100644
--- a/docs/data-operate/import/load-data-convert.md
+++ b/docs/data-operate/import/load-data-convert.md
@@ -403,25 +403,3 @@ where k1 is not null and k4 >= 1.2
 | ---- | ---- | ---- | ---- |
 | 2    | 200  | 2    | 1.2  |
 | 3    | 300  | 3    | 1.3  |
-
-## Best Practices
-
-### Data Quality Issues and Filtering Threshold
-
-The rows of data processed in the load job can be classified into the 
following three categories:
-
-- Filtered Rows: Data rows that are filtered out due to data quality issues. 
Data quality issues can include type errors, precision errors, strings 
exceeding length limits, mismatched file column counts, and data rows filtered 
out due to missing corresponding partitions.
-
-- Unselected Rows: These are data rows filtered out due to `preceding filter` 
or `where` column filtering conditions.
-
-- Loaded Rows: Data rows that are successfully loaded.
-
-Doris's load task allows users to set a maximum error rate 
(`max_filter_ratio`). If the error rate of the loaded data is below the 
threshold, the error rows will be ignored, and the other correct data will be 
loaded.
-
-The error rate is calculated as follows:
-
-```Plain
-#Filtered Rows / (#Filtered Rows + #Loaded Rows)
-```
-
-This means that `Unselected Rows` are not included in the error rate 
calculation.
\ No newline at end of file
diff --git a/docs/data-operate/import/load-strict-mode.md 
b/docs/data-operate/import/load-strict-mode.md
deleted file mode 100644
index fc90f48ba52..00000000000
--- a/docs/data-operate/import/load-strict-mode.md
+++ /dev/null
@@ -1,179 +0,0 @@
----
-{
-    "title": "Loading Strict Mode",
-    "language": "en"
-}
----
-
-<!-- 
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-
-Strict mode (strict_mode) is configured as a parameter in the import 
operation. This parameter affects the import behavior of certain values and the 
final imported data.
-
-This document mainly explains how to set strict mode, and the impact of strict 
mode.
-
-## How to set
-
-Strict mode is all False by default, i.e. off.
-
-Different import methods set strict mode in different ways.
-
-1. [BROKER 
LOAD](../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/BROKER-LOAD)
-
-   ```sql
-   LOAD LABEL example_db.label1
-   (
-       DATA INFILE("bos://my_bucket/input/file.txt")
-       INTO TABLE `my_table`
-       COLUMNS TERMINATED BY ","
-   )
-   WITH BROKER bos
-   (
-       "bos_endpoint" = "http://bj.bcebos.com";,
-       "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxxx",
-       "bos_secret_accesskey"="yyyyyyyyyyyyyyyyyyyyyyyy"
-   )
-   PROPERTIES
-   (
-       "strict_mode" = "true"
-   )
-   ````
-
-2. [STREAM 
LOAD](../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/STREAM-LOAD)
-
-   ```bash
-   curl --location-trusted -u user:passwd \
-   -H "strict_mode: true" \
-   -T 1.txt \
-   http://host:port/api/example_db/my_table/_stream_load
-   ````
-
-3. [ROUTINE 
LOAD](../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD)
-
-   ```sql
-   CREATE ROUTINE LOAD example_db.test_job ON my_table
-   PROPERTIES
-   (
-       "strict_mode" = "true"
-   )
-   FROM KAFKA
-   (
-       "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
-       "kafka_topic" = "my_topic"
-   );
-   ````
-
-4. 
[INSERT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/INSERT)
-
-   Set via [session variables](../../query/query-variables/variables.md):
-
-   ```sql
-   SET enable_insert_strict = true;
-   INSERT INTO my_table ...;
-   ````
-
-## The role of strict mode
-
-- Restricting the filtering of column type conversions during import.
-
-The strict filtering strategy is as follows:
-
-For column type conversion, if strict mode is turned on, the wrong data will 
be filtered. The wrong data here refers to: the original data is not `null`, 
but the result is `null` after column type conversion.
-
-The `column type conversion` mentioned here does not include the `null` value 
calculated by the function.
-
-For an imported column type that contains range restrictions, if the original 
data can pass the type conversion normally, but cannot pass the range 
restrictions, strict mode will not affect it. For example: if the type is 
`decimal(1,0)` and the original data is 10, it belongs to the range that can be 
converted by type but is not within the scope of the column declaration. This 
kind of data strict has no effect on it.
-
-1. Take the column type as TinyInt as an example:
-
-   | Primitive data type | Primitive data example | Converted value to TinyInt 
| Strict mode | Result                   |
-   | ------------------- | ---------------------- | -------------------------- 
| ----------- | ------------------------ |
-   | NULL                | \N                     | NULL                       
| ON or OFF   | NULL                     |
-   | Non-null value      | "abc" or 2000          | NULL                       
| On          | Illegal value (filtered) |
-   | non-null value      | "abc"                  | NULL                       
| off         | NULL                     |
-   | non-null value      | 1                      | 1                          
| on or off   | import correctly         |
-
-   > Description:
-   >
-   > 1. Columns in the table allow to import null values
-   > 2. After `abc` and `2000` are converted to TinyInt, they will become NULL 
due to type or precision issues. When strict mode is on, this data will be 
filtered. And if it is closed, `null` will be imported.
-
-2. Take the column type as Decimal(1,0) as an example
-
-   | Primitive Data Types | Examples of Primitive Data | Converted to Decimal 
| Strict Mode | Result                   |
-   | -------------------- | -------------------------- | -------------------- 
| ----------- | ------------------------ |
-   | Null                 | \N                         | null                 
| On or Off   | NULL                     |
-   | non-null value       | aaa                        | NULL                 
| on          | illegal value (filtered) |
-   | non-null value       | aaa                        | NULL                 
| off         | NULL                     |
-   | non-null value       | 1 or 10                    | 1 or 10              
| on or off   | import correctly         |
-
-   > Description:
-   >
-   > 1. Columns in the table allow to import null values
-   > 2. After `abc` is converted to Decimal, it will become NULL due to type 
problem. When strict mode is on, this data will be filtered. And if it is 
closed, `null` will be imported.
-   > 3. Although `10` is an out-of-range value, because its type conforms to 
the requirements of decimal, strict mode does not affect it. `10` will 
eventually be filtered in other import processing flows. But not filtered by 
strict mode.
-
-- Restricting partial column updates to only existing columns
-
-In strict mode, when performing partial column updates, each row of data 
inserted must have a key that already exists in the table. In non-strict mode, 
partial column updates can update existing rows with existing keys or insert 
new rows with non-existing keys.
-
-For example, consider the following table structure:
-
-```
-mysql> desc user_profile;
-+------------------+-----------------+------+-------+---------+-------+
-| Field            | Type            | Null | Key   | Default | Extra |
-+------------------+-----------------+------+-------+---------+-------+
-| id               | INT             | Yes  | true  | NULL    |       |
-| name             | VARCHAR(10)     | Yes  | false | NULL    | NONE  |
-| age              | INT             | Yes  | false | NULL    | NONE  |
-| city             | VARCHAR(10)     | Yes  | false | NULL    | NONE  |
-| balance          | DECIMALV3(9, 0) | Yes  | false | NULL    | NONE  |
-| last_access_time | DATETIME        | Yes  | false | NULL    | NONE  |
-+------------------+-----------------+------+-------+---------+-------+
-```
-
-The table contains the following data:
-
-```
-1,"kevin",18,"shenzhen",400,"2023-07-01 12:00:00"
-```
-
-When a user uses non-strict mode stream load for partial column updates and 
inserts the following data into the table:
-
-```
-1,500,2023-07-03 12:00:01
-3,23,2023-07-03 12:00:02
-18,9999999,2023-07-03 12:00:03
-```
-
-```
-curl  --location-trusted -u root -H "partial_columns:true" -H 
"strict_mode:false" -H "column_separator:," -H 
"columns:id,balance,last_access_time" -T /tmp/test.csv 
http://host:port/api/db1/user_profile/_stream_load
-```
-
-The existing row in the table will be updated, and two new rows will be 
inserted. For columns in the inserted data that are not specified by the user, 
if the column has a default value, it will be filled with the default value. 
Otherwise, if the column allows NULL, it will be filled with a NULL value. If 
neither condition is met, the insertion will not succeed.
-
-However, when a user uses strict mode stream load for partial column updates 
and inserts the above data into the table:
-
-```
-curl  --location-trusted -u root -H "partial_columns:true" -H 
"strict_mode:true" -H "column_separator:," -H 
"columns:id,balance,last_access_time" -T /tmp/test.csv 
http://host:port/api/db1/user_profile/_stream_load
-```
-
-In this case, since strict mode is enabled and the keys `(3)`, `(18)` in the 
second and third rows, respectively, do not exist in the original table, the 
import will fail.
\ No newline at end of file
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-strict-mode.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/error-data-handling.md
similarity index 52%
rename from 
i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-strict-mode.md
rename to 
i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/error-data-handling.md
index fc8d2c5b6df..e668294da6c 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-strict-mode.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/error-data-handling.md
@@ -1,6 +1,6 @@
 ---
 {
-    "title": "严格模式",
+    "title": "错误数据处理",
     "language": "zh-CN"
 }
 ---
@@ -24,82 +24,27 @@ specific language governing permissions and limitations
 under the License.
 -->
 
+在导入过程中，原始列跟目标列的数据类型可能不完全一致，导入会对数据类型不一致的原始列值进行转换。转换过程中可能会发生字段类型不匹配、字段超长、精度不匹配等转换失败的情况。
 
-严格模式（strict_mode）为导入操作中的一个参数配置。该参数会影响某些数值的导入行为和最终导入的数据。
+严格模式 (strict_mode) 用于控制导入过程中是否会对这些转换失败的错误数据行进行过滤。
 
-本文档主要说明如何设置严格模式，以及严格模式产生的影响。
+最大错误率 (max_filter_ratio) 用于控制能容忍的过滤掉的错误数据行所占的最大比例。
 
-## 如何设置
+## 严格模式 
 
-严格模式默认情况下都为 False，即关闭状态。
+严格模式有两个作用，一是对导入过程中列类型转换失败的错误数据行进行过滤；二是对部分列更新场景，限定部分列更新只能更新已有的列。
 
-不同的导入方式设置严格模式的方式不尽相同。
+### 列类型转换失败进行过滤
 
-1. [BROKER 
LOAD](../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/BROKER-LOAD)
+严格模式过滤的策略如下：
 
-   ```sql
-   LOAD LABEL example_db.label1
-   (
-       DATA INFILE("bos://my_bucket/input/file.txt")
-       INTO TABLE `my_table`
-       COLUMNS TERMINATED BY ","
-   )
-   WITH BROKER bos
-   (
-       "bos_endpoint" = "http://bj.bcebos.com";,
-       "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
-       "bos_secret_accesskey"="yyyyyyyyyyyyyyyyyyyyyyyyyy"
-   )
-   PROPERTIES
-   (
-       "strict_mode" = "true"
-   )
-   ```
-
-2. [STREAM 
LOAD](../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/STREAM-LOAD)
-
-   ```bash
-   curl --location-trusted -u user:passwd \
-   -H "strict_mode: true" \
-   -T 1.txt \
-   http://host:port/api/example_db/my_table/_stream_load
-   ```
-
-3. [ROUTINE 
LOAD](../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD)
-
-   ```sql
-   CREATE ROUTINE LOAD example_db.test_job ON my_table
-   PROPERTIES
-   (
-       "strict_mode" = "true"
-   ) 
-   FROM KAFKA
-   (
-       "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
-       "kafka_topic" = "my_topic"
-   );
-   ```
-
-4. 
[INSERT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/INSERT)
-
-   通过[会话变量](../../query/query-variables/variables)设置：
+- 关闭严格模式，会把转换失败的错误字段转换成 NULL 值，并把这些包含 NULL 值的错误数据行跟正确的数据行一起导入。
 
-   ```sql
-   SET enable_insert_strict = true;
-   INSERT INTO my_table ...;
-   ```
-
-## 严格模式的作用
+- 开启严格模式，会把转换失败的错误数据行过滤掉，只导入正确的数据行。这里的错误数据行是指：原始数据并不为 `NULL`，而在进行列类型转换后结果为 
`NULL` 的这行数据。这里说指的 `列类型转换`，并不包括用函数计算得出的 `NULL` 值。
 
-### 对于导入过程中的列类型转换进行严格过滤。
+- 正确的数据行和错误的数据行都有可能存在 `NULL` 值。如果目标列不允许 `NULL` 值，也会把这些包含 `NULL` 值的数据行过滤掉。
 
-严格过滤的策略如下：
-
-对于列类型转换来说，如果开启严格模式，则错误的数据将被过滤。这里的错误数据是指：原始数据并不为 `null`，而在进行列类型转换后结果为 `null` 
的这一类数据。
-
-这里说指的 `列类型转换`，并不包括用函数计算得出的 `null` 值。
-
-对于导入的某列类型包含范围限制的，如果原始数据能正常通过类型转换，但无法通过范围限制的，严格模式对其也不产生影响。例如：如果类型是 
`decimal(1,0)`, 原始数据为 10，则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。
+对于导入的某列类型包含范围限制的，如果原始数据能正常通过类型转换，但无法通过范围限制的，严格模式对其也不产生影响。例如：如果类型是 
`decimal(1,0)`, 原始数据为 10，则属于可以通过类型转换但不在列声明的范围内。这种数据严格模式对其不产生影响。
 
 **1. 以列类型为 TinyInt 来举例：**
 
@@ -111,8 +56,6 @@ under the License.
 | 非空值       | 1             | 1                     | 开启或关闭 | 正确导入         |
 
 :::tip
-说明：
-
 1. 表中的列允许导入空值
 
 2. `abc` 及 `2000` 在转换为 TinyInt 后，会因类型或精度问题变为 
NULL。在严格模式开启的情况下，这类数据将会被过滤。而如果是关闭状态，则会导入 `null`。
@@ -128,8 +71,6 @@ under the License.
 | 非空值       | 1 or 10      | 1 or 10               | 开启或关闭 | 正确导入         |
 
 :::tip
-说明：
-
 1. 表中的列允许导入空值
 
 2. `abc` 在转换为 Decimal 后，会因类型问题变为 NULL。在严格模式开启的情况下，这类数据将会被过滤。而如果是关闭状态，则会导入 
`null`。
@@ -177,10 +118,165 @@ curl  --location-trusted -u root -H 
"partial_columns:true" -H "strict_mode:false
 
 表中原有的一条数据将会被更新，此外还向表中插入了两条新数据。对于插入的数据中用户没有指定的列，如果该列有默认值，则会以默认值填充；否则，如果该列可以为 
NULL，则将以 NULL 值填充；否则本次插入不成功。
 
-而当用户使用严格模式的 Stream Load 部分列更新向表中插入上述数据时
+当用户使用严格模式的 Stream Load 部分列更新向表中插入上述数据时，由于开启了严格模式且第二、三行的数据的 key(`(3)`, `(18)`) 
不在原表中，所以本次导入会失败。
 
 ``` bash
 curl  --location-trusted -u root -H "partial_columns:true" -H 
"strict_mode:true" -H "column_separator:," -H 
"columns:id,balance,last_access_time" -T /tmp/test.csv 
http://host:port/api/db1/user_profile/_stream_load
 ```
 
-此时，由于开启了严格模式且第二、三行的数据的 key(`(3)`, `(18)`) 不在原表中，所以本次导入会失败。
+### 设置方法
+严格模式默认情况下都为 False，即关闭状态。 不同的导入方式设置严格模式的方式不尽相同。
+
+[STREAM LOAD](./import-way/stream-load-manual.md)
+
+   ```bash
+   curl --location-trusted -u user:passwd \
+   -H "strict_mode: true" \
+   -T 1.txt \
+   http://host:port/api/example_db/my_table/_stream_load
+   ```
+
+[BROKER LOAD](./import-way/broker-load-manual.md)
+
+   ```sql
+   LOAD LABEL example_db.exmpale_label_1
+   (
+       DATA INFILE("s3://your_bucket_name/your_file.txt")
+       INTO TABLE load_test
+       COLUMNS TERMINATED BY ","
+   )
+   WITH S3
+   (
+       "AWS_ENDPOINT" = "AWS_ENDPOINT",
+       "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
+       "AWS_SECRET_KEY"="AWS_SECRET_KEY",
+       "AWS_REGION" = "AWS_REGION"
+   )
+   PROPERTIES
+   (
+        "strict_mode" = "true"
+   );
+   ```
+[ROUTINE LOAD](./import-way/routine-load-manual.md)
+
+   ```sql
+   CREATE ROUTINE LOAD example_db.test_job ON my_table
+   PROPERTIES
+   (
+       "strict_mode" = "true"
+   ) 
+   FROM KAFKA
+   (
+       "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
+       "kafka_topic" = "my_topic"
+   );
+   ```
+
+[MySQL Load](./import-way/mysql-load-manual.md)
+
+   ```sql
+   LOAD DATA LOCAL
+   INFILE 'testData'
+   INTO TABLE testDb.testTbl
+   PROPERTIES
+   (
+       "strict_mode" = "true"
+   );
+   ```
+
+[INSERT INTO](./import-way/insert-into-manual.md)
+
+   ```sql
+   SET enable_insert_strict = true;
+   INSERT INTO my_table ...;
+   ```
+
+## 最大错误率
+
+导入任务允许用户设置最大错误率 `max_filter_ratio` ，如果导入数据的错误率低于最大错误率，则这些错误行将被忽略，其他正确的数据将被导入, 
否则该次导入就会失败。
+
+### 错误率计算方法
+导入作业中被处理的数据行可以分为如下三种：
+
+- Filtered Rows 
因数据质量不合格而被过滤掉的数据。数据质量不合格包括类型错误、精度错误、字符串长度超长、文件列数不匹配等数据格式问题，以及因没有对应的分区而被过滤掉的数据行。
+
+- Unselected Rows 这部分为因 [前置过滤](./load-data-convert.md) 或 
[后置过滤](./load-data-convert.md) 条件而被过滤掉的数据行。
+
+- Loaded Rows 被正确导入的数据行。
+
+错误率的计算为：
+
+```Plain
+#Filtered Rows / (#Filtered Rows + #Loaded Rows)
+```
+
+也就是说 `Unselected Rows` 不会参与错误率的计算。
+
+### 设置方法
+`max_filter_ratio` 默认为 0, 表示当有一条错误数据时，整个导入任务将会失败。
+
+[Stream Load](./import-way/stream-load-manual.md)
+
+   ```bash
+   curl --location-trusted -u user:passwd \
+   -H "max_filter_ratio: 0.1" \
+   -T 1.txt \
+   http://host:port/api/example_db/my_table/_stream_load
+   ```
+
+[Broker Load](./import-way/broker-load-manual.md)
+
+   ```sql
+   LOAD LABEL example_db.exmpale_label_1
+   (
+        DATA INFILE("s3://your_bucket_name/your_file.txt")
+        INTO TABLE load_test
+        COLUMNS TERMINATED BY ","
+   )
+   WITH S3
+   (
+        "AWS_ENDPOINT" = "AWS_ENDPOINT",
+        "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
+        "AWS_SECRET_KEY"="AWS_SECRET_KEY",
+        "AWS_REGION" = "AWS_REGION"
+   )
+   PROPERTIES
+   (
+        "max_filter_ratio" = "0.1"
+   );
+   ```
+[Routine Load](./import-way/routine-load-manual.md)
+
+   ```sql
+   CREATE ROUTINE LOAD example_db.test_job ON my_table
+   PROPERTIES
+   (
+        "max_filter_ratio" = "0.1"
+   ) 
+   FROM KAFKA
+   (
+        "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
+        "kafka_topic" = "my_topic"
+   );
+   ```
+
+[MySQL Load](./import-way/mysql-load-manual.md)
+
+   ```sql
+   LOAD DATA LOCAL
+   INFILE 'testData'
+   INTO TABLE testDb.testTbl
+   PROPERTIES (
+        "max_filter_ratio"="0.1"
+    );
+   ```
+
+[INSERT INTO](./import-way/insert-into-manual.md)
+
+   ```sql
+   SET insert_max_filter_ratio = 0.1;
+   INSERT INTO my_table FROM S3/HDFS/LOCAL();
+   ```
+:::tip
+仅当 `enable_insert_strict` 值为 `false` 时生效，用于控制使用 `INSERT INTO FROM 
S3/HDFS/LOCAL()` 时，设定最大错误率的。默认为 1.0 表示容忍所有错误。
+::::
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
index ad791674b98..62a3b1c8b0e 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-data-convert.md
@@ -404,25 +404,3 @@ where k1 is not null and k4 >= 1.2
 | ---- | ---- | ---- | ---- |
 | 2    | 200  | 2    | 1.2  |
 | 3    | 300  | 3    | 1.3  |
-
-## 最佳实践
-
-### 数据质量问题和过滤阈值
-
-导入作业中被处理的数据行可以分为如下三种：
-
-- Filtered Rows 
因数据质量不合格而被过滤掉的数据。数据质量不合格包括类型错误、精度错误、字符串长度超长、文件列数不匹配等数据格式问题，以及因没有对应的分区而被过滤掉的数据行。
-
-- Unselected Rows 这部分为因 `preceding filter` 或 `where` 列过滤条件而被过滤掉的数据行。
-
-- Loaded Rows 被正确导入的数据行。
-
-Doris 
的导入任务允许用户设置最大错误率（`max_filter_ratio`）。如果导入的数据的错误率低于阈值，则这些错误行将被忽略，其他正确的数据将被导入。
-
-错误率的计算方式为：
-
-```Plain
-#Filtered Rows / (#Filtered Rows + #Loaded Rows)
-```
-
-也就是说 `Unselected Rows` 不会参与错误率的计算。
\ No newline at end of file
diff --git a/sidebars.json b/sidebars.json
index 72d6884ac48..8103dd5aea8 100644
--- a/sidebars.json
+++ b/sidebars.json
@@ -129,11 +129,11 @@
                             ]
                         },
                         "data-operate/import/load-json-format",
-                        "data-operate/import/migrate-data-from-other-olap",
-                        "data-operate/import/load-atomicity",
+                        "data-operate/import/error-data-handling",
                         "data-operate/import/load-data-convert",
                         "data-operate/import/min-load-replica-num",
-                        "data-operate/import/load-strict-mode"
+                        "data-operate/import/load-atomicity",
+                        "data-operate/import/migrate-data-from-other-olap"
                     ]
                 },
                 {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [Opt](load) add error data handling (#917)

Reply via email to