(doris-website) branch master updated: [docs](update) Update Broker Load EN Version (#556)

luzhijing Mon, 15 Apr 2024 20:19:22 -0700

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 0a435c66ba [docs](update) Update Broker Load EN Version (#556)
0a435c66ba is described below

commit 0a435c66bae224e1c94d7ab8387bdc8a241dd5fb
Author: KassieZ <[email protected]>
AuthorDate: Tue Apr 16 11:19:12 2024 +0800

    [docs](update) Update Broker Load EN Version (#556)
---
 blog/release-note-2.1.2.md                         | 241 +++---
 .../data-operate/import/broker-load-manual.md      |   6 +-
 .../data-operate/import/broker-load-manual.md      | 921 +++++++++++++--------
 3 files changed, 723 insertions(+), 445 deletions(-)

diff --git a/blog/release-note-2.1.2.md b/blog/release-note-2.1.2.md
index 3733f4fc71..c05d034fa3 100644
--- a/blog/release-note-2.1.2.md
+++ b/blog/release-note-2.1.2.md
@@ -1,120 +1,121 @@
----
-{
-    'title': 'Apache Doris 2.1.2 just released',
-    'summary': 'Dear community, Apache Doris 2.1.2 has been officially 
released on April 12, 2024. This version submits several enhancements and bug 
fixes to further improve the performance and stability.',
-    'date': '2024-04-12',
-    'author': 'Apache Doris',
-    'tags': ['Release Notes'],
-    'picked': "true",
-    'order': "1",
-    "image": '/images/2.1.2.png'
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-  http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-Dear community, Apache Doris 2.1.2 has been officially released on April 12, 
2024. This version submits several enhancements and bug fixes to further 
improve the performance and stability.
-
-**Quick Download:** https://doris.apache.org/download/
-
-**GitHub Release:** https://github.com/apache/doris/releases
-
-## Behavior Changed
-
-1. Set the default value of the `data_consistence` property of EXPORT to 
partition to make export more stable during load. 
-
-- https://github.com/apache/doris/pull/32830
-
-2. Some of MySQL Connector (eg, dotnet MySQL.Data) rely on variable's column 
type to make connection.
-
-   eg, select @[@autocommit]([@autocommit](https://github.com/autocommit)) 
should with column type BIGINT, not BIT, otherwise it will throw error. So we 
change column type of @[@autocommit](https://github.com/autocommit) to BIGINT. 
-
-- https://github.com/apache/doris/pull/33282
-
-
-## Upgrade Problem
-
-1. Normal workload group is not created when upgrade from 2.0 or other old 
versions. 
-
-  - https://github.com/apache/doris/pull/33197
-
-##  New Feature
-
-
-1. Add processlist table in information_schema database, users could use this 
table to query active connections. 
-
-  - https://github.com/apache/doris/pull/32511
-
-2. Add a new table valued function `LOCAL` to allow access file system like 
shared storage. 
-
-  - https://github.com/apache/doris-website/pull/494
-
-
-## Optimization
-
-1. Skip some useless process to make graceful stop more quickly in K8s env. 
-
-  - https://github.com/apache/doris/pull/33212
-
-2. Add rollup table name in profile to help find the mv selection problem. 
-
-  - https://github.com/apache/doris/pull/33137
-
-3. Add test connection function to DB2 database to allow user check the 
connection when create DB2 Catalog. 
-
-  - https://github.com/apache/doris/pull/33335
-
-4. Add DNS Cache for FQDN to accelerate the connect process among BEs in K8s 
env. 
-
-  - https://github.com/apache/doris/pull/32869
-
-5. Refresh external table's rowcount async to make the query plan more stable. 
-
-  - https://github.com/apache/doris/pull/32997
-
-
-## Bugfix
-
-
-1. Fix Iceberg Catalog of HMS and Hadoop do not support Iceberg properties 
like "io.manifest.cache-enabled" to enable manifest cache in Iceberg. 
-
-  - https://github.com/apache/doris/pull/33113
-
-2. The offset params in `LEAD`/`LAG` function could use 0 as offset. 
-
-  - https://github.com/apache/doris/pull/33174
-
-3. Fix some timeout issues with load. 
-
-  - https://github.com/apache/doris/pull/33077
-
-  - https://github.com/apache/doris/pull/33260
-
-4. Fix core problem related with `ARRAY`/`MAP`/`STRUCT` compaction process. 
-
-  - https://github.com/apache/doris/pull/33130
-
-  - https://github.com/apache/doris/pull/33295
-
-5. Fix runtime filter wait timeout. 
-
-  - https://github.com/apache/doris/pull/33369
-
-6. Fix `unix_timestamp` core for string input in auto partition. 
-
-  - https://github.com/apache/doris/pull/32871
+---
+{
+    'title': 'Apache Doris 2.1.2 just released',
+    'summary': 'Dear community, Apache Doris 2.1.2 has been officially 
released on April 12, 2024. This version submits several enhancements and bug 
fixes to further improve the performance and stability.',
+    'date': '2024-04-12',
+    'author': 'Apache Doris',
+    'tags': ['Release Notes'],
+    'picked': "true",
+    'order': "1",
+    "image": '/images/2.1.2.png'
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Dear community, Apache Doris 2.1.2 has been officially released on April 12, 
2024. This version submits several enhancements and bug fixes to further 
improve the performance and stability.
+
+**Quick Download:** https://doris.apache.org/download/
+
+**GitHub Release:** https://github.com/apache/doris/releases
+
+## Behavior Changed
+
+1. Set the default value of the `data_consistence` property of EXPORT to 
partition to make export more stable during load. 
+
+- https://github.com/apache/doris/pull/32830
+
+2. Some of MySQL Connector (eg, dotnet MySQL.Data) rely on variable's column 
type to make connection.
+
+   eg, select @[@autocommit]([@autocommit](https://github.com/autocommit)) 
should with column type BIGINT, not BIT, otherwise it will throw error. So we 
change column type of @[@autocommit](https://github.com/autocommit) to BIGINT. 
+
+- https://github.com/apache/doris/pull/33282
+
+
+## Upgrade Problem
+
+1. Normal workload group is not created when upgrade from 2.0 or other old 
versions. 
+
+  - https://github.com/apache/doris/pull/33197
+
+##  New Feature
+
+
+1. Add processlist table in information_schema database, users could use this 
table to query active connections. 
+
+  - https://github.com/apache/doris/pull/32511
+
+2. Add a new table valued function `LOCAL` to allow access file system like 
shared storage. 
+
+  - https://github.com/apache/doris-website/pull/494
+
+
+## Optimization
+
+1. Skip some useless process to make graceful stop more quickly in K8s env. 
+
+  - https://github.com/apache/doris/pull/33212
+
+2. Add rollup table name in profile to help find the mv selection problem. 
+
+  - https://github.com/apache/doris/pull/33137
+
+3. Add test connection function to DB2 database to allow user check the 
connection when create DB2 Catalog. 
+
+  - https://github.com/apache/doris/pull/33335
+
+4. Add DNS Cache for FQDN to accelerate the connect process among BEs in K8s 
env. 
+
+  - https://github.com/apache/doris/pull/32869
+
+5. Refresh external table's rowcount async to make the query plan more stable. 
+
+  - https://github.com/apache/doris/pull/32997
+
+
+## Bugfix
+
+
+1. Fix Iceberg Catalog of HMS and Hadoop do not support Iceberg properties 
like "io.manifest.cache-enabled" to enable manifest cache in Iceberg. 
+
+  - https://github.com/apache/doris/pull/33113
+
+2. The offset params in `LEAD`/`LAG` function could use 0 as offset. 
+
+  - https://github.com/apache/doris/pull/33174
+
+3. Fix some timeout issues with load. 
+
+  - https://github.com/apache/doris/pull/33077
+
+  - https://github.com/apache/doris/pull/33260
+
+4. Fix core problem related with `ARRAY`/`MAP`/`STRUCT` compaction process. 
+
+  - https://github.com/apache/doris/pull/33130
+
+  - https://github.com/apache/doris/pull/33295
+
+5. Fix runtime filter wait timeout. 
+
+  - https://github.com/apache/doris/pull/33369
+
+6. Fix `unix_timestamp` core for string input in auto partition. 
+
+  - https://github.com/apache/doris/pull/32871
+
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/import/broker-load-manual.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/import/broker-load-manual.md
index 1e488a260c..be931f2056 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/import/broker-load-manual.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/import/broker-load-manual.md
@@ -28,7 +28,7 @@ under the License.
 
 Stream Load 是一种推的方式，即导入的数据依靠客户端读取，并推送到 Doris。Broker Load 则是将导入请求发送给 Doris，有 
Doris 主动拉取数据，所以如果数据存储在类似 HDFS 或者 对象存储中，则使用 Broker Load 是最方便的。这样，数据就不需要经过客户端，而有 
Doris 直接读取导入。
 
-从 HDFS 或者 S3 直接读取，也可以通过 湖仓一体/T VF 中的 HDFS TVF 或者 S3 TVF 进行导入。基于 TVF 的 Insert 
Into 当前为同步导入，Broker Load 是一个异步的导入方式。
+从 HDFS 或者 S3 直接读取，也可以通过 [湖仓一体/TVF]((../../lakehouse/file)) 中的 HDFS TVF 或者 S3 
TVF 进行导入。基于 TVF 的 Insert Into 当前为同步导入，Broker Load 是一个异步的导入方式。
 
 Broker Load 适合源数据存储在远程存储系统，比如 HDFS，并且数据量比较大的场景。
 
@@ -591,7 +591,9 @@ WITH BROKER "broker_name"
 
 通常用户需要通过操作命令中的 `WITH BROKER "broker_name"` 子句来指定一个已经存在的 Broker Name。Broker 
Name 是用户在通过 `ALTER SYSTEM ADD BROKER` 命令添加 Broker 进程时指定的一个名称。一个名称通常对应一个或多个 
Broker 进程。Doris 会根据名称选择可用的 Broker 进程。用户可以通过 `SHOW BROKER` 命令查看当前集群中已经存在的 Broker。
 
-注：Broker Name 只是一个用户自定义名称，不代表 Broker 的类型。
+:::info 备注
+Broker Name 只是一个用户自定义名称，不代表 Broker 的类型。
+:::
 
 **认证信息**
 
diff --git 
a/versioned_docs/version-2.0/data-operate/import/broker-load-manual.md 
b/versioned_docs/version-2.0/data-operate/import/broker-load-manual.md
index e684397c6d..671c09f6c5 100644
--- a/versioned_docs/version-2.0/data-operate/import/broker-load-manual.md
+++ b/versioned_docs/version-2.0/data-operate/import/broker-load-manual.md
@@ -24,253 +24,53 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Broker Load
+## Why introduce Broker Load?
 
-Broker load is an asynchronous import method, and the supported data sources 
depend on the data sources supported by the 
[Broker](../../../advanced/broker.md) process.
+Stream Load is a push-based method, where the data to be imported relies on 
the client to read and push it to Doris. Broker Load, on the other hand, 
involves sending an import request to Doris, and Doris actively pulls the data. 
Therefore, if the data is stored in systems like HDFS or object storage, using 
Broker Load is the most convenient. This way, the data doesn't need to pass 
through the client but is directly read and imported by Doris.
 
-Because the data in the Doris table is ordered, Broker load uses the doris 
cluster resources to sort the data when importing data. Complete massive 
historical data migration relative to Spark load, the Doris cluster resource 
usage is relatively large. , this method is used when the user does not have 
Spark computing resources. If there are Spark computing resources, it is 
recommended to use [Spark 
load](../../../data-operate/import/import-way/spark-load-manual.md).
+Direct reads from HDFS or S3 can also be imported through HDFS TVF or S3 TVF 
in the [Lakehouse/TVF](../../lakehouse/file). The current "Insert Into" based 
on TVF is a synchronous import, while Broker Load is an asynchronous import 
method.
 
-Users need to create [Broker 
load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD)
 import through MySQL protocol and import by viewing command to check the 
import result.
+Broker Load is suitable for scenarios where the source data is stored in 
remote storage systems, such as HDFS, and the data volume is relatively large.
 
-## Applicable scene
+## Basic Principles
 
-* The source data is in a storage system that the broker can access, such as 
HDFS.
-* The amount of data is at the level of tens to hundreds of GB.
+After a user submits an import task, the Frontend (FE) generates a 
corresponding plan. Based on the current number of Backend (BE) nodes and the 
size of the file, the plan is distributed to multiple BE nodes for execution, 
with each BE node handling a portion of the import data.
 
-## Fundamental
+During execution, the BE nodes pull data from the Broker, perform necessary 
transformations, and then import the data into the system. Once all BE nodes 
have completed the import, the FE makes the final determination on whether the 
import was successful.
 
-After the user submits the import task, FE will generate the corresponding 
Plan and distribute the Plan to multiple BEs for execution according to the 
current number of BEs and file size, and each BE executes a part of the 
imported data.
+![Broker Load](/images/broker-load.png)
 
-BE pulls data from the broker during execution, and imports the data into the 
system after transforming the data. All BEs are imported, and FE ultimately 
decides whether the import is successful.
 
-```
-                 +
-                 | 1. user create broker load
-                 v
-            +----+----+
-            |         |
-            |   FE    |
-            |         |
-            +----+----+
-                 |
-                 | 2. BE etl and load the data
-    +--------------------------+
-    |            |             |
-+---v---+     +--v----+    +---v---+
-|       |     |       |    |       |
-|  BE   |     |  BE   |    |   BE  |
-|       |     |       |    |       |
-+---+-^-+     +---+-^-+    +--+-^--+
-    | |           | |         | |
-    | |           | |         | | 3. pull data from broker
-+---v-+-+     +---v-+-+    +--v-+--+
-|       |     |       |    |       |
-|Broker |     |Broker |    |Broker |
-|       |     |       |    |       |
-+---+-^-+     +---+-^-+    +---+-^-+
-    | |           | |          | |
-+---v-+-----------v-+----------v-+-+
-|       HDFS/BOS/AFS cluster       |
-|                                  |
-+----------------------------------+
-
-```
-
-## start import
-
-Let's look at [Broker 
Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD)
 through several actual scenario examples. use
-
-### Data import of Hive partition table
-
-1. Create Hive table
-
-```sql
-##Data format is: default, partition field is: day
-CREATE TABLE `ods_demo_detail`(
-  `id` string,
-  `store_id` string,
-  `company_id` string,
-  `tower_id` string,
-  `commodity_id` string,
-  `commodity_name` string,
-  `commodity_price` double,
-  `member_price` double,
-  `cost_price` double,
-  `unit` string,
-  `quantity` double,
-  `actual_price` double
-)
-PARTITIONED BY (day string)
-row format delimited fields terminated by ','
-lines terminated by '\n'
-````
-
-Then use Hive's Load command to import your data into the Hive table
-
-````
-load data local inpath '/opt/custorm' into table ods_demo_detail;
-````
-
-2. Create a Doris table, refer to the specific table syntax: [CREATE 
TABLE](../../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE)
-
-````
-CREATE TABLE `doris_ods_test_detail` (
-  `rq` date NULL,
-  `id` varchar(32) NOT NULL,
-  `store_id` varchar(32) NULL,
-  `company_id` varchar(32) NULL,
-  `tower_id` varchar(32) NULL,
-  `commodity_id` varchar(32) NULL,
-  `commodity_name` varchar(500) NULL,
-  `commodity_price` decimal(10, 2) NULL,
-  `member_price` decimal(10, 2) NULL,
-  `cost_price` decimal(10, 2) NULL,
-  `unit` varchar(50) NULL,
-  `quantity` int(11) NULL,
-  `actual_price` decimal(10, 2) NULL
-) ENGINE=OLAP
-UNIQUE KEY(`rq`, `id`, `store_id`)
-PARTITION BY RANGE(`rq`)
-(
-PARTITION P_202204 VALUES [('2022-04-01'), ('2022-05-01')))
-DISTRIBUTED BY HASH(`store_id`) BUCKETS 1
-PROPERTIES (
-"replication_allocation" = "tag.location.default: 3",
-"dynamic_partition.enable" = "true",
-"dynamic_partition.time_unit" = "MONTH",
-"dynamic_partition.start" = "-2147483648",
-"dynamic_partition.end" = "2",
-"dynamic_partition.prefix" = "P_",
-"dynamic_partition.buckets" = "1",
-"in_memory" = "false",
-"storage_format" = "V2"
-);
-````
+As seen in the diagram, BE nodes rely on Broker processes to read data from 
corresponding remote storage systems. The introduction of Broker processes 
primarily aims to accommodate different remote storage systems. Users can 
develop their own Broker processes according to established standards. These 
Broker processes, which can be developed using Java, offer better compatibility 
with various storage systems in the big data ecosystem. The separation of 
Broker processes from BE nodes ensur [...]
 
-3. Start importing data
+Currently, BE nodes have built-in support for HDFS and S3 Brokers. Therefore, 
when importing data from HDFS or S3, there is no need to additionally start a 
Broker process. However, if a customized Broker implementation is required, the 
corresponding Broker process needs to be deployed.
 
-   Specific syntax reference: [Broker 
Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD)
+## SQL syntax for importing
 
 ```sql
-LOAD LABEL broker_load_2022_03_23
+LOAD LABEL load_label
 (
-    DATA 
INFILE("hdfs://192.168.20.123:8020/user/hive/warehouse/ods.db/ods_demo_detail/*/*")
-    INTO TABLE doris_ods_test_detail
-    COLUMNS TERMINATED BY ","
-  
(id,store_id,company_id,tower_id,commodity_id,commodity_name,commodity_price,member_price,cost_price,unit,quantity,actual_price)
-    COLUMNS FROM PATH AS (`day`)
-   SET
-   (rq = 
str_to_date(`day`,'%Y-%m-%d'),id=id,store_id=store_id,company_id=company_id,tower_id=tower_id,commodity_id=commodity_id,commodity_name=commodity_name,commodity_price=commodity_price,member_price
 
=member_price,cost_price=cost_price,unit=unit,quantity=quantity,actual_price=actual_price)
+data_desc1[, data_desc2, ...]
 )
-WITH BROKER "broker_name_1"
-(
-"username" = "hdfs",
-"password" = ""
-)
-PROPERTIES
-(
-    "timeout"="1200",
-    "max_filter_ratio"="0.1"
-);
-````
+WITH [HDFS|S3|BROKER broker_name] 
+[broker_properties]
+[load_properties]
+[COMMENT "comments"];
+```
 
-### Hive partition table import (ORC format)
-1. Create Hive partition table, ORC format
+For the specific syntax for usage, please refer to [BROKER 
LOAD](../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD)
 in the SQL manual.
 
-```sql
-#Data format: ORC partition: day
-CREATE TABLE `ods_demo_orc_detail`(
-  `id` string,
-  `store_id` string,
-  `company_id` string,
-  `tower_id` string,
-  `commodity_id` string,
-  `commodity_name` string,
-  `commodity_price` double,
-  `member_price` double,
-  `cost_price` double,
-  `unit` string,
-  `quantity` double,
-  `actual_price` double
-)
-PARTITIONED BY (day string)
-row format delimited fields terminated by ','
-lines terminated by '\n'
-STORED AS ORC
-````
-
-2. Create a Doris table. The table creation statement here is the same as the 
Doris table creation statement above. Please refer to the above .
-
-3. Import data using Broker Load
-
-   ```sql
-   LOAD LABEL dish_2022_03_23
-   (
-       DATA 
INFILE("hdfs://10.220.147.151:8020/user/hive/warehouse/ods.db/ods_demo_orc_detail/*/*")
-       INTO TABLE doris_ods_test_detail
-       COLUMNS TERMINATED BY ","
-       FORMAT AS "orc"
-   
(id,store_id,company_id,tower_id,commodity_id,commodity_name,commodity_price,member_price,cost_price,unit,quantity,actual_price)
-       COLUMNS FROM PATH AS (`day`)
-      SET
-      (rq = 
str_to_date(`day`,'%Y-%m-%d'),id=id,store_id=store_id,company_id=company_id,tower_id=tower_id,commodity_id=commodity_id,commodity_name=commodity_name,commodity_price=commodity_price,member_price
 
=member_price,cost_price=cost_price,unit=unit,quantity=quantity,actual_price=actual_price)
-   )
-   WITH BROKER "broker_name_1"
-   (
-   "username" = "hdfs",
-   "password" = ""
-   )
-   PROPERTIES
-   (
-       "timeout"="1200",
-       "max_filter_ratio"="0.1"
-   );
-   ````
-
-   **Notice:**
-
-   - `FORMAT AS "orc"` : here we specify the data format to import
-   - `SET` : Here we define the field mapping relationship between the Hive 
table and the Doris table and some operations for field conversion
-
-### HDFS file system data import
-
-Let's continue to take the Doris table created above as an example to 
demonstrate importing data from HDFS through Broker Load.
-
-The statement to import the job is as follows:
+## Checking import status
 
-```sql
-LOAD LABEL demo.label_20220402
-        (
-            DATA INFILE("hdfs://10.220.147.151:8020/tmp/test_hdfs.txt")
-            INTO TABLE `ods_dish_detail_test`
-            COLUMNS TERMINATED BY "\t" 
(id,store_id,company_id,tower_id,commodity_id,commodity_name,commodity_price,member_price,cost_price,unit,quantity,actual_price)
-        )
-        with HDFS (
-            "fs.defaultFS"="hdfs://10.220.147.151:8020",
-            "hadoop.username"="root"
-        )
-        PROPERTIES
-        (
-            "timeout"="1200",
-            "max_filter_ratio"="0.1"
-        );
-````
-
-The specific parameters here can refer to: [Broker](../../../advanced/broker) 
and [Broker Load](../../../sql-manual/sql-reference-v2 
/Data-Manipulation-Statements/Load/BROKER-LOAD) documentation
-
-## View import status
-
-We can view the status information of the above import task through the 
following command,
-
-The specific syntax reference for viewing the import status [SHOW 
LOAD](../../../sql-manual/sql-reference/Show-Statements/SHOW-LOAD)
+Broker Load is an asynchronous import method, and the specific import results 
can be viewed through the [SHOW 
LOAD](../../sql-manual/sql-reference/Show-Statements/SHOW-LOAD) command.
 
-```sql
+```Plain
 mysql> show load order by createtime desc limit 1\G;
-**************************** 1. row ******************** ******
+*************************** 1. row ***************************
          JobId: 41326624
-         Label: broker_load_2022_03_23
+         Label: broker_load_2022_04_15
          State: FINISHED
-      Progress: ETL: 100%; LOAD: 100%
+      Progress: ETL:100%; LOAD:100%
           Type: BROKER
        EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=27
       TaskInfo: cluster:N/A; timeout(s):1200; max_filter_ratio:0.1
@@ -281,165 +81,640 @@ mysql> show load order by createtime desc limit 1\G;
  LoadStartTime: 2022-04-01 18:59:11
 LoadFinishTime: 2022-04-01 18:59:11
            URL: NULL
-    JobDetails: {"Unfinished 
backends":{"5072bde59b74b65-8d2c0ee5b029adc0":[]},"ScannedRows":27,"TaskNumber":1,"All
 backends":{"5072bde59b74b65-8d2c0ee5b029adc0":[36728051]},"FileNumber 
":1,"FileSize":5540}
+    JobDetails: {"Unfinished 
backends":{"5072bde59b74b65-8d2c0ee5b029adc0":[]},"ScannedRows":27,"TaskNumber":1,"All
 
backends":{"5072bde59b74b65-8d2c0ee5b029adc0":[36728051]},"FileNumber":1,"FileSize":5540}
 1 row in set (0.01 sec)
-````
+```
 
-## Cancel import
+## Cancelling an Import
 
-When the broker load job status is not CANCELLED or FINISHED, it can be 
manually canceled by the user. When canceling, you need to specify the Label of 
the import task to be canceled. Cancel the import command syntax to execute 
[CANCEL 
LOAD](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/CANCEL-LOAD)
 view.
+When the status of a Broker Load job is not CANCELLED or FINISHED, it can be 
manually cancelled by the user. To cancel, the user needs to specify the label 
of the import task to be cancelled. The syntax for the cancel import command 
can be viewed by executing [CANCEL 
LOAD](../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/CANCEL-LOAD).
 
-For example: cancel the import job with the label broker_load_2022_03_23 on 
the database demo
+For example: To cancel the import job with the label "broker_load_2022_03_23" 
on the DEMO database.
 
 ```sql
 CANCEL LOAD FROM demo WHERE LABEL = "broker_load_2022_03_23";
-````
-## Relevant system configuration
+```
 
-### Broker parameters
+## HDFS Load
 
-Broker Load needs to use the Broker process to access remote storage. 
Different brokers need to provide different parameters. For details, please 
refer to [Broker documentation](../../../advanced/broker).
+### Simple Authentication
 
-### FE configuration
+Simple authentication refers to the configuration of Hadoop where 
hadoop.security.authentication is set to "simple".
+
+```Plain
+(
+    "username" = "user",
+    "password" = ""
+);
+```
+
+The username should be configured as the user to be accessed, and the password 
can be left blank.
+
+### Kerberos Authentication
+
+This authentication method requires the following information:
+
+- **hadoop.security.authentication:** Specifies the authentication method as 
Kerberos.
+
+- **hadoop.kerberos.principal:** Specifies the Kerberos principal.
+
+- **hadoop.kerberos.keytab:** Specifies the file path of the Kerberos keytab. 
The file must be an absolute path on the server where the Broker process is 
located and must be accessible by the Broker process.
+
+- **kerberos_keytab_content:** Specifies the content of the Kerberos keytab 
file after being encoded in base64. This can be used as an alternative to the 
kerberos_keytab configuration.
+
+Example configuration:
+
+```Plain
+(
+    "hadoop.security.authentication" = "kerberos",
+    "hadoop.kerberos.principal" = "[email protected]",
+    "hadoop.kerberos.keytab" = "/home/doris/my.keytab"
+)
+(
+    "hadoop.security.authentication" = "kerberos",
+    "hadoop.kerberos.principal" = "[email protected]",
+    "kerberos_keytab_content" = "ASDOWHDLAWIDJHWLDKSALDJSDIWALD"
+)
+```
+
+To use Kerberos authentication, the [krb5.conf (opens new 
window)](https://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html)
 file is required. The krb5.conf file contains Kerberos configuration 
information. Typically, the krb5.conf file should be installed in the /etc 
directory. You can override the default location by setting the KRB5_CONFIG 
environment variable. An example of the krb5.conf file content is as follows:
+
+```Plain
+[libdefaults]
+    default_realm = DORIS.HADOOP
+    default_tkt_enctypes = des3-hmac-sha1 des-cbc-crc
+    default_tgs_enctypes = des3-hmac-sha1 des-cbc-crc
+    dns_lookup_kdc = true
+    dns_lookup_realm = false
+
+[realms]
+    DORIS.HADOOP = {
+        kdc = kerberos-doris.hadoop.service:7005
+    }
+```
 
-The following configurations belong to the system-level configuration of 
Broker load, that is, the configurations that apply to all Broker load import 
tasks. The configuration values are adjusted mainly by modifying `fe.conf`.
+### HDFS HA Mode
 
-- 
min_bytes_per_broker_scanner/max_bytes_per_broker_scanner/max_broker_concurrency
+This configuration is used to access HDFS clusters deployed in HA (High 
Availability) mode.
 
-  The first two configurations limit the minimum and maximum amount of data 
processed by a single BE. The third configuration limits the maximum number of 
concurrent imports for a job. The minimum amount of data processed, the maximum 
number of concurrency, the size of the source file and the number of BEs in the 
current cluster ** together determine the number of concurrent imports**.
+- **dfs.nameservices:** Specifies the name of the HDFS service, which can be 
customized. For example: "dfs.nameservices" = "my_ha".
 
-  ````text
-  The number of concurrent imports this time = Math.min (source file 
size/minimum processing capacity, maximum concurrent number, current number of 
BE nodes)
-  The processing volume of a single BE imported this time = the size of the 
source file / the number of concurrent imports this time
-  ````
+- **dfs.ha.namenodes.xxx:** Customizes the names of the namenodes, with 
multiple names separated by commas. Here, xxx represents the custom name 
specified in dfs.nameservices. For example: "dfs.ha.namenodes.my_ha" = "my_nn".
 
-  Usually the maximum amount of data supported by an import job is 
`max_bytes_per_broker_scanner * number of BE nodes`. If you need to import a 
larger amount of data, you need to adjust the size of the 
`max_bytes_per_broker_scanner` parameter appropriately.
+- **dfs.namenode.rpc-address.xxx.nn:** Specifies the RPC address information 
for the namenode. In this context, nn represents the namenode name configured 
in dfs.ha.namenodes.xxx. For example: "dfs.namenode.rpc-address.my_ha.my_nn" = 
"host:port".
 
-  default allocation:
+- **dfs.client.failover.proxy.provider:** Specifies the provider for client 
connections to the namenode. The default is 
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
 
-  ````text
-  Parameter name: min_bytes_per_broker_scanner, the default is 64MB, the unit 
is bytes.
-  Parameter name: max_broker_concurrency, default 10.
-  Parameter name: max_bytes_per_broker_scanner, the default is 500G, the unit 
is bytes.
-  ````
+An example configuration is as follows:
 
-## Best Practices
+```Plain
+(
+    "fs.defaultFS" = "hdfs://my_ha",
+    "dfs.nameservices" = "my_ha",
+    "dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
+    "dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
+    "dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
+    "dfs.client.failover.proxy.provider" = 
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
+)
+```
 
-### Application scenarios
+HA mode can be combined with the previous two authentication methods for 
cluster access. For example, accessing HA HDFS through simple authentication:
 
-The most suitable scenario for using Broker load is the scenario where the 
original data is in the file system (HDFS, BOS, AFS). Secondly, since Broker 
load is the only way of asynchronous import in a single import, if users need 
to use asynchronous access when importing large files, they can also consider 
using Broker load.
+```Plain
+(
+    "username"="user",
+    "password"="passwd",
+    "fs.defaultFS" = "hdfs://my_ha",
+    "dfs.nameservices" = "my_ha",
+    "dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
+    "dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
+    "dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
+    "dfs.client.failover.proxy.provider" = 
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
+)
+```
 
-### The amount of data
+### Import Example
 
-Only the case of a single BE is discussed here. If the user cluster has 
multiple BEs, the amount of data in the title below should be multiplied by the 
number of BEs. For example: if the user has 3 BEs, the value below 3G 
(inclusive) should be multiplied by 3, that is, below 9G (inclusive).
+- Importing TXT Files from HDFS
 
-- Below 3G (included)
+  ```sql
+  LOAD LABEL demo.label_20220402
+  (
+      DATA INFILE("hdfs://host:port/tmp/test_hdfs.txt")
+      INTO TABLE `load_hdfs_file_test`
+      COLUMNS TERMINATED BY "\t"            
+      (id,age,name)
+  ) 
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  )
+  PROPERTIES
+  (
+      "timeout"="1200",
+      "max_filter_ratio"="0.1"
+  );
+  ```
 
-  Users can directly submit Broker load to create import requests.
+-  HDFS requires the configuration of NameNode HA (High Availability)
 
-- Above 3G
+  ```sql
+  LOAD LABEL demo.label_20220402
+  (
+      DATA INFILE("hdfs://hafs/tmp/test_hdfs.txt")
+      INTO TABLE `load_hdfs_file_test`
+      COLUMNS TERMINATED BY "\t"            
+      (id,age,name)
+  ) 
+  with HDFS
+  (
+      "hadoop.username" = "user",
+      "fs.defaultFS"="hdfs://hafs"，
+      "dfs.nameservices" = "hafs",
+      "dfs.ha.namenodes.hafs" = "my_namenode1, my_namenode2",
+      "dfs.namenode.rpc-address.hafs.my_namenode1" = "nn1_host:rpc_port",
+      "dfs.namenode.rpc-address.hafs.my_namenode2" = "nn2_host:rpc_port",
+      "dfs.client.failover.proxy.provider.hafs" = 
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
+  )
+  PROPERTIES
+  (
+      "timeout"="1200",
+      "max_filter_ratio"="0.1"
+  );
+  ```
 
-  Since the maximum processing capacity of a single import BE is 3G, the 
import of files exceeding 3G needs to be adjusted by adjusting the import 
parameters of Broker load to realize the import of large files.
+- Importing data from HDFS using wildcards to match two batches of files and 
importing them into two separate tables
 
-  1. Modify the maximum scan amount and maximum concurrent number of a single 
BE according to the current number of BEs and the size of the original file.
+  ```sql
+  LOAD LABEL example_db.label2
+  (
+      DATA INFILE("hdfs://host:port/input/file-10*")
+      INTO TABLE `my_table1`
+      PARTITION (p1)
+      COLUMNS TERMINATED BY ","
+      (k1, tmp_k2, tmp_k3)
+      SET (
+          k2 = tmp_k2 + 1,
+          k3 = tmp_k3 + 1
+      )
+      DATA INFILE("hdfs://host:port/input/file-20*")
+      INTO TABLE `my_table2`
+      COLUMNS TERMINATED BY ","
+      (k1, k2, k3)
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  );
+  ```
 
-     ````text
-     Modify the configuration in fe.conf
-     max_broker_concurrency = number of BEs
-     The amount of data processed by a single BE of the current import task = 
original file size / max_broker_concurrency
-     max_bytes_per_broker_scanner >= the amount of data processed by a single 
BE of the current import task
-     
-     For example, for a 100G file, the number of BEs in the cluster is 10
-     max_broker_concurrency = 10
-     # >= 10G = 100G / 10
-     max_bytes_per_broker_scanner = 1069547520
-     ````
+To import two batches of files matching the wildcards `file-10*` and 
`file-20*` from HDFS and load them into two separate tables `my_table1` and 
`my_table2`. In this case, my_table1 specifies that the data should be imported 
into partition p1, and the values in the second and third columns of the source 
files should be incremented by 1 before being imported.
 
-     After modification, all BEs will process the import task concurrently, 
each BE processing part of the original file.
+- Import a batch of data from HDFS using wildcards
 
-     *Note: The configurations in the above two FEs are all system 
configurations, that is to say, their modifications are applied to all Broker 
load tasks. *
+  ```sql
+  LOAD LABEL example_db.label3
+  (
+      DATA INFILE("hdfs://host:port/user/doris/data/*/*")
+      INTO TABLE `my_table`
+      COLUMNS TERMINATED BY "\\x01"
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  );
+  ```
 
-  2. Customize the timeout time of the current import task when creating an 
import
+To specify the delimiter as the commonly used default delimiter for Hive, 
which is \x01, and to use the wildcard character * to refer to all files in all 
directories under the data directory.
 
-     ````text
-     The amount of data processed by a single BE of the current import task / 
the slowest import speed of the user Doris cluster (MB/s) >= the timeout time 
of the current import task >= the amount of data processed by a single BE of 
the current import task / 10M/s
-     
-     For example, for a 100G file, the number of BEs in the cluster is 10
-     # >= 1000s = 10G / 10M/s
-     timeout = 1000
-     ````
+- Import Parquet format data and specify the FORMAT as `parquet`
 
-  3. When the user finds that the timeout time calculated in the second step 
exceeds the default import timeout time of 4 hours
+  ```sql
+  LOAD LABEL example_db.label4
+  (
+      DATA INFILE("hdfs://host:port/input/file")
+      INTO TABLE `my_table`
+      FORMAT AS "parquet"
+      (k1, k2, k3)
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  );
+  ```
 
-     At this time, it is not recommended for users to directly increase the 
maximum import timeout to solve the problem. If the single import time exceeds 
the default import maximum timeout time of 4 hours, it is best to divide the 
files to be imported and import them in multiple times to solve the problem. 
The main reason is: if a single import exceeds 4 hours, the time cost of 
retrying after the import fails is very high.
+The default method is to determine by file extension.
 
-     The expected maximum import file data volume of the Doris cluster can be 
calculated by the following formula:
+- Import the data and extract the partition field from the file path
 
-     ````text
-     Expected maximum import file data volume = 14400s * 10M/s * number of BEs
-     For example: the number of BEs in the cluster is 10
-     Expected maximum import file data volume = 14400s * 10M/s * 10 = 1440000M 
≈ 1440G
-     
-     Note: The average user's environment may not reach the speed of 10M/s, so 
it is recommended that files over 500G be divided and imported.
-     ````
+  ```sql
+  LOAD LABEL example_db.label10
+  (
+      DATA INFILE("hdfs://host:port/input/city=beijing/*/*")
+      INTO TABLE `my_table`
+      FORMAT AS "csv"
+      (k1, k2, k3)
+      COLUMNS FROM PATH AS (city, utc_date)
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  );
+  ```
 
-### Job scheduling
+The columns in the `my_table` are `k1`, `k2`, `k3`, `city`, and `utc_date`.
 
-The system limits the number of running Broker Load jobs in a cluster to 
prevent too many Load jobs from running at the same time.
+The directory 
`hdfs://hdfs_host:hdfs_port/user/doris/data/input/dir/city=beijing` contains 
the following files:
 
-First, the configuration parameter of FE: `desired_max_waiting_jobs` will 
limit the number of Broker Load jobs that have not started or are running (job 
status is PENDING or LOADING) in a cluster. Default is 100. If this threshold 
is exceeded, newly submitted jobs will be rejected outright.
+```Plain
+hdfs://hdfs_host:hdfs_port/input/city=beijing/utc_date=2020-10-01/0000.csv
+hdfs://hdfs_host:hdfs_port/input/city=beijing/utc_date=2020-10-02/0000.csv
+hdfs://hdfs_host:hdfs_port/input/city=tianji/utc_date=2020-10-03/0000.csv
+hdfs://hdfs_host:hdfs_port/input/city=tianji/utc_date=2020-10-04/0000.csv
+```
 
-A Broker Load job is divided into pending task and loading task phases. Among 
them, the pending task is responsible for obtaining the information of the 
imported file, and the loading task will be sent to the BE to execute the 
specific import task.
+The file only contains three columns of data:`k1`,`k2`, and `k3`. The other 
two columns,`city` and `utc_date`, will be extracted from the file path.
 
-The FE configuration parameter `async_pending_load_task_pool_size` is used to 
limit the number of pending tasks running at the same time. It is also 
equivalent to controlling the number of import tasks that are actually running. 
This parameter defaults to 10. That is to say, assuming that the user submits 
100 Load jobs, at the same time only 10 jobs will enter the LOADING state and 
start execution, while other jobs are in the PENDING waiting state.
+- Filter the imported data
 
-The configuration parameter `async_loading_load_task_pool_size` of FE is used 
to limit the number of tasks of loading tasks running at the same time. A 
Broker Load job will have one pending task and multiple loading tasks (equal to 
the number of DATA INFILE clauses in the LOAD statement). So 
`async_loading_load_task_pool_size` should be greater than or equal to 
`async_pending_load_task_pool_size`.
+  ```sql
+  LOAD LABEL example_db.label6
+  (
+      DATA INFILE("hdfs://host:port/input/file")
+      INTO TABLE `my_table`
+      (k1, k2, k3)
+      SET (
+          k2 = k2 + 1
+      )
+      PRECEDING FILTER k1 = 1
+      WHERE k1 > k2
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  );
+  ```
 
-### Performance Analysis
+Only the rows where k1 = 1 in the original data and k1 > k2 after 
transformation will be imported.
 
-Session variables can be enabled by executing `set enable_profile=true` before 
submitting the LOAD job. Then submit the import job. After the import job is 
completed, you can view the profile of the import job in the `Queris` tab of 
the FE web page.
+- Import data and extract the time partition field from the file path.
 
-You can check the [SHOW LOAD 
PROFILE](../../../sql-manual/sql-reference/Show-Statements/SHOW-LOAD-PROFILE) 
help document for more usage help information.
+  ```sql
+  LOAD LABEL example_db.label7
+  (
+      DATA INFILE("hdfs://host:port/user/data/*/test.txt") 
+      INTO TABLE `tbl12`
+      COLUMNS TERMINATED BY ","
+      (k2,k3)
+      COLUMNS FROM PATH AS (data_time)
+      SET (
+          data_time=str_to_date(data_time, '%Y-%m-%d %H%%3A%i%%3A%s')
+      )
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username" = "user"
+  );
+  ```
 
-This Profile can help analyze the running status of import jobs.
+:::tip Tip
+The time contains "%3A". In HDFS paths, colons ":" are not allowed, so all 
colons are replaced with "%3A".
+:::
 
-Currently the Profile can only be viewed after the job has been successfully 
executed
+There are the following files under the path:
 
-## common problem
+```Plain
+/user/data/data_time=2020-02-17 00%3A00%3A00/test.txt
+/user/data/data_time=2020-02-18 00%3A00%3A00/test.txt
+```
 
-- Import error: `Scan bytes per broker scanner exceed limit:xxx`
+The table structure is as follows:
 
-  Please refer to the Best Practices section in the document to modify the FE 
configuration items `max_bytes_per_broker_scanner` and `max_broker_concurrency`
+```Plain
+data_time DATETIME,
+k2        INT,
+k3        INT
+```
 
-- Import error: `failed to send batch` or `TabletWriter add batch with unknown 
id`
+- Use Merge mode for import
 
-  Modify `query_timeout` and `streaming_load_rpc_max_alive_time_sec` 
appropriately.
+  ```sql
+  LOAD LABEL example_db.label8
+  (
+      MERGE DATA INFILE("hdfs://host:port/input/file")
+      INTO TABLE `my_table`
+      (k1, k2, k3, v2, v1)
+      DELETE ON v2 > 100
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username"="user"
+  )
+  PROPERTIES
+  (
+      "timeout" = "3600",
+      "max_filter_ratio" = "0.1"
+  );
+  ```
 
-  streaming_load_rpc_max_alive_time_sec:
+To use Merge mode for import, the "my_table" must be a Unique Key table. When 
the value of the "v2" column in the imported data is greater than 100, that row 
will be considered a deletion row. The timeout for the import task is 3600 
seconds, and an error rate of up to 10% is allowed.
 
-  During the import process, Doris will open a Writer for each Tablet to 
receive data and write. This parameter specifies the Writer's wait timeout. If 
the Writer does not receive any data within this time, the Writer will be 
automatically destroyed. When the system processing speed is slow, the Writer 
may not receive the next batch of data for a long time, resulting in an import 
error: `TabletWriter add batch with unknown id`. At this time, this 
configuration can be appropriately increa [...]
+- Specify the "source_sequence" column during import to ensure the order of 
replacements.
 
-- Import error: `LOAD_RUN_FAIL; msg:Invalid Column Name:xxx`
+  ```sql
+  LOAD LABEL example_db.label9
+  (
+      DATA INFILE("hdfs://host:port/input/file")
+      INTO TABLE `my_table`
+      COLUMNS TERMINATED BY ","
+      (k1,k2,source_sequence,v1,v2)
+      ORDER BY source_sequence
+  ) 
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username"="user"
+  );
+  The "my_table" must be a Unique Key model table and have a specified 
Sequence column. The data will maintain its order based on the values in the 
"source_sequence" column in the source data.
+  ```
 
-  If it is data in PARQUET or ORC format, the column name of the file header 
needs to be consistent with the column name in the doris table, such as:
+- Import the specified file format as `json`, and specify the `json_root` and 
jsonpaths accordingly.
 
-  ````text
-  (tmp_c1,tmp_c2)
-  SET
+  ```SQL
+  LOAD LABEL example_db.label10
   (
-      id=tmp_c2,
-      name=tmp_c1
+      DATA INFILE("hdfs://host:port/input/file.json")
+      INTO TABLE `my_table`
+      FORMAT AS "json"
+      PROPERTIES(
+        "json_root" = "$.item",
+        "jsonpaths" = "[$.id, $.city, $.code]"
+      )       
   )
-  ````
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username"="user"
+  );
+  ```
+
+The `jsonpaths` can also be used in conjunction with the column list and `SET 
(column_mapping)` :
+
+  ```sql
+  LOAD LABEL example_db.label10
+  (
+      DATA INFILE("hdfs://host:port/input/file.json")
+      INTO TABLE `my_table`
+      FORMAT AS "json"
+      (id, code, city)
+      SET (id = id * 10)
+      PROPERTIES(
+        "json_root" = "$.item",
+        "jsonpaths" = "[$.id, $.code, $.city]"
+      )       
+  )
+  with HDFS
+  (
+    "fs.defaultFS"="hdfs://host:port",
+    "hadoop.username"="user"
+  );
+  ```
+
+## S3 Load
+
+Doris supports importing data directly from object storage systems that 
support the S3 protocol through the S3 protocol. Here, we mainly introduce how 
to import data stored in AWS S3. For importing data from other object storage 
systems that support the S3 protocol, you can refer to the steps for AWS S3.
+
+### Preparation
+
+- AK and SK: First, you need to find or regenerate your AWS `Access Keys`. You 
can find instructions on how to generate them in the AWS console under `My 
Security Credentials`.
+
+- REGION and ENDPOINT: The REGION can be selected when creating a bucket or 
viewed in the bucket list. The S3 ENDPOINT for each REGION can be found in the 
[AWS 
documentation](https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region).
+
+### Import example
+
+```sql
+    LOAD LABEL example_db.exmpale_label_1
+    (
+        DATA INFILE("s3://your_bucket_name/your_file.txt")
+        INTO TABLE load_test
+        COLUMNS TERMINATED BY ","
+    )
+    WITH S3
+    (
+        "AWS_ENDPOINT" = "AWS_ENDPOINT",
+        "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
+        "AWS_SECRET_KEY"="AWS_SECRET_KEY",
+        "AWS_REGION" = "AWS_REGION"
+    )
+    PROPERTIES
+    (
+        "timeout" = "3600"
+    );
+```
+
+### Common Issues
+
+- The S3 SDK defaults to using the virtual-hosted style method for accessing 
objects. However, some object storage systems may not have enabled or supported 
the virtual-hosted style access. In such cases, we can add the `use_path_style` 
parameter to force the use of the path style method:
+
+  ```Plain
+    WITH S3
+    (
+          "AWS_ENDPOINT" = "AWS_ENDPOINT",
+          "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
+          "AWS_SECRET_KEY"="AWS_SECRET_KEY",
+          "AWS_REGION" = "AWS_REGION",
+          "use_path_style" = "true"
+    )
+  ```
+
+- Support for accessing all object storage systems that support the S3 
protocol using temporary credentials (TOKEN) is available. The usage is as 
follows:
+
+  ```Plain
+    WITH S3
+    (
+          "AWS_ENDPOINT" = "AWS_ENDPOINT",
+          "AWS_ACCESS_KEY" = "AWS_TEMP_ACCESS_KEY",
+          "AWS_SECRET_KEY" = "AWS_TEMP_SECRET_KEY",
+          "AWS_TOKEN" = "AWS_TEMP_TOKEN",
+          "AWS_REGION" = "AWS_REGION"
+    )
+  ```
+
+Importing Data Using Other Brokers
+
+The Broker for other remote storage systems is an optional process in the 
Doris cluster, primarily used to support Doris in reading and writing files and 
directories on remote storage. Currently, the following storage system Broker 
implementations are provided:
+
+- Alibaba Cloud OSS
+
+- Baidu Cloud BOS
+
+- Tencent Cloud CHDFS
+
+- Tencent Cloud GFS
+
+- Huawei Cloud OBS
+
+- JuiceFS
+
+- Google Cloud Storage (GCS)
+
+The Broker provides services through an RPC service port and operates as a 
stateless Java process. Its primary responsibility is to encapsulate POSIX-like 
file operations for remote storage, such as open, pread, pwrite, and more. 
Additionally, the Broker does not keep track of any other information, which 
means that all the connection details, file information, and permission details 
related to the remote storage must be passed to the Broker process through 
parameters during RPC calls. T [...]
+
+The Broker serves solely as a data pathway and does not involve any 
computational tasks, thus requiring minimal memory usage. Typically, a Doris 
system would deploy one or more Broker processes. Furthermore, Brokers of the 
same type are grouped together and assigned a unique name (Broker name).
+
+This section primarily focuses on the parameters required by the Broker when 
accessing different remote storage systems, such as connection information, 
authentication details, and more. Understanding and correctly configuring these 
parameters is crucial for successful and secure data exchange between Doris and 
the remote storage systems.
+
+### Broker Information
+
+The information of the Broker consists of two parts: the name (Broker name) 
and the authentication information. The usual syntax format is as follows:
+
+```Plain
+WITH BROKER "broker_name" 
+(
+    "username" = "xxx",
+    "password" = "yyy",
+    "other_prop" = "prop_value",
+    ...
+);
+```
+
+**Broker Name**
+
+Typically, users need to specify an existing Broker Name through the `WITH 
BROKER "broker_name"` clause in the operation command. The Broker Name is a 
name designated by the user when adding a Broker process through the `ALTER 
SYSTEM ADD BROKER` command. One name usually corresponds to one or more Broker 
processes. Doris will select an available Broker process based on the name. 
Users can view the Brokers that currently exist in the cluster through the 
`SHOW BROKER` command.
+
+:::info Note
+The Broker Name is merely a user-defined name and does not represent the type 
of Broker.
+:::
+
+**Authentication Information**
+Different Broker types and access methods require different authentication 
information. The authentication information is usually provided in the Property 
Map in a Key-Value format after `WITH BROKER "broker_name"`.
+
+### Broker Examples
+
+- Alibaba Cloud OSS
+
+  ```Plain
+  (
+      "fs.oss.accessKeyId" = "",
+      "fs.oss.accessKeySecret" = "",
+      "fs.oss.endpoint" = ""
+  )
+  ```
+
+- JuiceFS
+
+  ```Plain
+  (
+      "fs.defaultFS" = "jfs://xxx/",
+      "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+      "fs.AbstractFileSystem.jfs.impl" = "io.juicefs.JuiceFS",
+      "juicefs.meta" = "xxx",
+      "juicefs.access-log" = "xxx"
+  )
+  ```
+
+- GCS
+
+  When using a Broker to access GCS, the Project ID is required, while other 
parameters are optional. Please refer to the [GCS 
Config](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/branch-2.2.x/gcs/CONFIGURATION.md)
 for all parameter configurations.
+
+  ```Plain
+  (
+      "fs.gs.project.id" = "Your Project ID",
+      "fs.AbstractFileSystem.gs.impl" = 
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
+      "fs.gs.impl" = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
+  )
+  ```
+
+## Related Configurations
+
+The following configurations belong to the system-level settings for Broker 
load, which affect all Broker load import tasks. These configurations can be 
adjusted by modifying the `fe.conf `file.
+
+**min_bytes_per_broker_scanner**
+
+- Default: 64MB.
+
+- The minimum amount of data processed by a single BE in a Broker Load job.
+
+**max_bytes_per_broker_scanner**
+
+- Default: 500GB.
+
+- The maximum amount of data processed by a single BE in a Broker Load job.
+
+Typically, the maximum supported data volume for an import job is 
`max_bytes_per_broker_scanner * the number of BE nodes`. If you need to import 
a larger volume of data, you may need to adjust the size of the 
`max_bytes_per_broker_scanner` parameter appropriately.
+
+**max_broker_concurrency**
+
+- Default: 10.
+
+- Limits the maximum concurrency of imports for a job.
+
+- The minimum processed data volume, maximum concurrency, size of the source 
file, and the current number of BE nodes jointly determine the concurrency of 
this import.
+
+```Plain
+Import Concurrency = Math.min(Source File Size / Minimum Processing Amount, 
Maximum Concurrency, Current Number of BE Nodes)
+Processing Volume per BE for this Import = Source File Size / Import 
Concurrency
+```
+
+## Common Issues
+
+**1. Import Error: `Scan bytes per broker scanner exceed limit:xxx`**
+
+Please refer to the best practices section in the documentation and modify the 
FE configuration items `max_bytes_per_broker_scanner` and 
`max_broker_concurrency.`
+
+**2. Import Error: : `failed to send batch` or `TabletWriter add batch with 
unknown id`**
+
+Appropriately adjust the `query_timeout` and 
`streaming_load_rpc_max_alive_time_sec` settings.
+
+**3. Import Error: `LOAD_RUN_FAIL; msg:Invalid Column Name:xxx`**
+
+For PARQUET or ORC format data, the column names in the file header must match 
the column names in the Doris table. For example:
+
+```Plain
+(tmp_c1,tmp_c2)
+SET
+(
+    id=tmp_c2,
+    name=tmp_c1
+)
+```
+
+This represents fetching columns named (tmp_c1, tmp_c2) in the parquet or orc 
file and mapping them to the (id, name) columns in the Doris table. If no set 
is specified, the columns in the file header will be used for mapping.
+
+:::info Note
+
+If ORC files are generated directly using certain Hive versions, the column 
headers in the ORC file may not be the Hive metadata, but (_col0, _col1, _col2, 
...), which may lead to the Invalid Column Name error. In this case, mapping 
using SET is necessary.
+:::
+
+**5. Import Error: `Failed to get S3 FileSystem for bucket is null/empty`**
+
+The bucket information is incorrect or does not exist. Or the bucket format is 
not supported. When creating a bucket name with an underscore using GCS, such 
as `s3://gs_bucket/load_tbl`, the S3 Client may report an error when accessing 
GCS. It is recommended not to use underscores when creating bucket paths.
+
+**6. Import Timeout**
+
+The default timeout for imports is 4 hours. If a timeout occurs, it is not 
recommended to directly increase the maximum import timeout to solve the 
problem. If the single import time exceeds the default import timeout of 4 
hours, it is best to split the file to be imported and perform multiple imports 
to solve the problem. Setting an excessively long timeout time can lead to high 
costs for retrying failed imports.
+
+You can calculate the expected maximum import file data volume for the Doris 
cluster using the following formula:
+
+Expected Maximum Import File Data Volume = 14400s * 10M/s * Number of BEs
+
+For example, if the cluster has 10 BEs:
 
-  Represents getting the column with (tmp_c1, tmp_c2) as the column name in 
parquet or orc, which is mapped to the (id, name) column in the doris table. If 
set is not set, the column in column is used as the map.
+Expected Maximum Import File Data Volume = 14400s * 10M/s * 10 = 1440000M ≈ 
1440G
 
-  Note: If you use the orc file directly generated by some hive versions, the 
header in the orc file is not hive meta data, but (_col0, _col1, _col2, ...), 
which may cause Invalid Column Name error, then you need to use set to map
+:::info Note
 
-- Import error: `Failed to get S3 FileSystem for bucket is null/empty`
-  1. The bucket is incorrect or does not exist.
-  2. The bucket format is not supported. When creating a bucket name with `_` 
on GCS, like `s3://gs_bucket/load_tbl`, the S3 Client will report an error. It 
is recommended not to use `_` on GCS.
+In general, user environments may not reach speeds of 10M/s, so it is 
recommended to split files exceeding 500G before importing.
+:::
 
-## more help
+## More Help
 
-For more detailed syntax and best practices used by Broker Load, see [Broker 
Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD)
 command manual, you can also enter `HELP BROKER LOAD` in the MySql client 
command line for more help information.
+For more detailed syntax and best practices for using  [Broker 
Load](../..//sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD)
 , please refer to the Broker Load command manual. You can also enter HELP 
BROKER LOAD in the MySQL client command line to obtain more help information.
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [docs](update) Update Broker Load EN Version (#556)

Reply via email to