This is an automated email from the ASF dual-hosted git repository.
dataroaring pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new dd2f78db47 fix load overview (#1012)
dd2f78db47 is described below
commit dd2f78db4741e72d72c4a9f9cc6902c6f56f70f3
Author: Yongqiang YANG <[email protected]>
AuthorDate: Tue Aug 20 15:08:54 2024 +0800
fix load overview (#1012)
---
.../import/import-way/insert-into-manual.md | 6 ++-
docs/data-operate/import/load-manual.md | 53 +++++++++++++---------
.../import/import-way/insert-into-manual.md | 7 ++-
.../current/data-operate/import/load-manual.md | 45 ++++++++++--------
4 files changed, 70 insertions(+), 41 deletions(-)
diff --git a/docs/data-operate/import/import-way/insert-into-manual.md
b/docs/data-operate/import/import-way/insert-into-manual.md
index d9d7984e5b..70580e3cef 100644
--- a/docs/data-operate/import/import-way/insert-into-manual.md
+++ b/docs/data-operate/import/import-way/insert-into-manual.md
@@ -79,7 +79,7 @@ VALUES (1, "Emily", 25),
(5, "Ava", 17);
```
-INSERT INTO is a synchronous import method, where the import result is
directly returned to the user.
+INSERT INTO is a synchronous import method, where the import result is
directly returned to the user. You can enable [group
commit](../import-way/group-commit-manual.md) to achieve high performance.
```JSON
Query OK, 5 rows affected (0.308 sec)
@@ -127,6 +127,10 @@ MySQL> SELECT COUNT(*) FROM testdb.test_table2;
1 row in set (0.071 sec)
```
+4. You can use [JOB](../../scheduler/job-scheduler.md) make the INSERT
operation execute asynchronously.
+
+5. Sources can be [tvf](../../../lakehouse/file.md) or tables in a
[catalog](../../../lakehouse/database).
+
### View INSERT INTO jobs
You can use the `SHOW LOAD` command to view the completed INSERT INTO tasks.
diff --git a/docs/data-operate/import/load-manual.md
b/docs/data-operate/import/load-manual.md
index 188c4b928d..6dcf57e666 100644
--- a/docs/data-operate/import/load-manual.md
+++ b/docs/data-operate/import/load-manual.md
@@ -26,24 +26,35 @@ under the License.
Apache Doris offers various methods for importing and integrating data,
allowing you to import data from diverse sources into the database. These
methods can be categorized into four types:
-1. **Real-Time Writing**: Data is written into Doris tables in real-time via
HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying.
- - For small amounts of data (once every 5 minutes), use [JDBC
INSERT](./import-way/insert-into-manual.md).
- - For higher concurrency or frequency (more than 20 concurrent writes or
multiple writes per minute), enable [Group
Commit](./import-way/group-commit-manual.md) and use JDBC INSERT or Stream Load.
- - For high throughput, use [Stream Load](./import-way/stream-load-manua)
via HTTP.
+- **Real-Time Writing**: Data is written into Doris tables in real-time via
HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying.
-2. **Streaming Synchronization**: Real-time data streams (e.g., Flink, Kafka,
transactional databases) are imported into Doris tables, ideal for real-time
analysis and querying.
- - Use [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) to
write Flink’s real-time data streams into Doris.
- - Use [Routine Load](./import-way/routine-load-manual.md) or [Doris Kafka
Connector](../../ecosystem/doris-kafka-connector.md) for Kafka’s real-time data
streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON
formats, while Kafka Connector writes data to Doris, supporting Avro, JSON,
CSV, and Protobuf formats.
- - Use [Flink CDC](../../ecosystem/flink-doris-connector.md) or
[Datax](../../ecosystem/datax.md) to write transactional database CDC data
streams into Doris.
+ - For small amounts of data (once every 5 minutes), you can use [JDBC
INSERT](./import-way/insert-into-manual.md).
-3. **Batch Import**: Data is batch-loaded from external storage systems (e.g.,
S3, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data
import needs.
- - Use [Broker Load](./import-way/broker-load-manual.md) to write files
from S3 and HDFS into Doris.
- - Use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to
synchronize files from S3, HDFS, and NAS into Doris, with asynchronous writing
via [JOB](../scheduler/job-scheduler.md).
- - Use [Stream Load](./import-way/stream-load-manua) or [Doris
Streamloader](../../ecosystem/doris-streamloader.md) to write local files into
Doris.
+ - For higher concurrency or frequency (more than 20 concurrent writes or
multiple writes per minute), you can enable enable [Group
Commit](./import-way/group-commit-manual.md) and use JDBC INSERT or Stream Load.
-4. **External Data Source Integration**: Query and partially import data from
external sources (e.g., Hive, JDBC, Iceberg) into Doris tables.
- - Create a [Catalog](../../lakehouse/lakehouse-overview.md) to read data
from external sources and use [INSERT INTO
SELECT](./import-way/insert-into-manual.md) to synchronize this data into
Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md).
- - Use [X2Doris](./migrate-data-from-other-olap.md) to migrate data from
other AP systems into Doris.
+ - For high throughput, you can use [Stream
Load](./import-way/stream-load-manua) via HTTP.
+
+- **Streaming Synchronization**: Real-time data streams (e.g., Flink, Kafka,
transactional databases) are imported into Doris tables, ideal for real-time
analysis and querying.
+
+ - You can use [Flink Doris
Connector](../../ecosystem/flink-doris-connector.md) to write Flink’s real-time
data streams into Doris.
+
+ - You can use [Routine Load](./import-way/routine-load-manual.md) or
[Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) for Kafka’s
real-time data streams. Routine Load pulls data from Kafka to Doris and
supports CSV and JSON formats, while Kafka Connector writes data to Doris,
supporting Avro, JSON, CSV, and Protobuf formats.
+
+ - You can use [Flink CDC](../../ecosystem/flink-doris-connector.md) or
[Datax](../../ecosystem/datax.md) to write transactional database CDC data
streams into Doris.
+
+- **Batch Import**: Data is batch-loaded from external storage systems (e.g.,
S3, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data
import needs.
+
+ - You can use [Broker Load](./import-way/broker-load-manual.md) to write
files from S3 and HDFS into Doris.
+
+ - You can use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to
synchronously load files from S3, HDFS, and NAS into Doris, and you can perform
the operation asynchronously using a [JOB](../scheduler/job-scheduler.md).
+
+ - You can use [Stream Load](./import-way/stream-load-manua) or [Doris
Streamloader](../../ecosystem/doris-streamloader.md) to write local files into
Doris.
+
+- **External Data Source Integration**: Query and partially import data from
external sources (e.g., Hive, JDBC, Iceberg) into Doris tables.
+
+ - You can create a [Catalog](../../lakehouse/lakehouse-overview.md) to
read data from external sources and use [INSERT INTO
SELECT](./import-way/insert-into-manual.md) to synchronize this data into
Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md).
+
+ - You can use [X2Doris](./migrate-data-from-other-olap.md) to migrate data
from other AP systems into Doris.
Each import method in Doris is an implicit transaction by default. For more
information on transactions, refer to [Transactions](../transaction.md).
@@ -53,10 +64,10 @@ Doris's import process mainly involves various aspects such
as data sources, dat
| Import Method | Use Case
| Supported File Formats | Single Import Volume | Import
Mode |
| :-------------------------------------------- |
:----------------------------------------- | ----------------------- |
----------------- | -------- |
-| [Stream Load](./import-way/stream-load-manual) | Import from local
data | csv, json, parquet, orc | Less than 10GB
| Synchronous |
-| [Broker Load](./import-way/broker-load-manual.md) | Import from
object storage, HDFS, etc. | csv, json, parquet, orc | Tens
of GB to hundreds of GB | Asynchronous |
-| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | <p>Import single
or small batch data</p><p>Import via JDBC, etc.</p> | SQL |
Simple testing | Synchronous |
-| [INSERT INTO SELECT](./import-way/insert-into-manual.md) | <p>Import data
between Doris internal tables</p><p>Import external tables</p> | SQL
| Depending on memory size | Synchronous |
+| [Stream Load](./import-way/stream-load-manual) | Importing local
files or push data in applications via http. | csv,
json, parquet, orc | Less than 10GB | Synchronous |
+| [Broker Load](./import-way/broker-load-manual.md) | Importing from
object storage, HDFS, etc. | csv, json, parquet, orc | Tens
of GB to hundreds of GB | Asynchronous |
+| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | Writing data via
JDBC. | SQL | Simple testing | Synchronous |
+| [INSERT INTO SELECT](./import-way/insert-into-manual.md) | Importing from an
external source like a table in a catalog or files in s3. | SQL
| Depending on memory size | Synchronous, Asynchronous via Job |
| [Routine Load](./import-way/routine-load-manual.md) | Real-time import
from Kafka | csv, json | Micro-batch
import MB to GB | Asynchronous |
-| [MySQL Load](./import-way/mysql-load-manual.md) | Import from local
data | csv | Less than 10GB
| Synchronous |
-| [Group Commit](./import-way/group-commit-manual.md) |
High-frequency small batch import | Depending on
the import method used | Micro-batch import KB | -
|
\ No newline at end of file
+| [MySQL Load](./import-way/mysql-load-manual.md) | Importing from
local files. | csv | Less than
1GB | Synchronous |
+| [Group Commit](./import-way/group-commit-manual.md) | Writing with
high frequency. | Depending on the import method
used | Micro-batch import KB | - |
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md
index 198c0be954..0bdf6b570b 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md
@@ -84,7 +84,7 @@ VALUES (1, "Emily", 25),
(5, "Ava", 17);
```
-INSERT INTO 是一种同步导入方式,导入结果会直接返回给用户。
+INSERT INTO 是一种同步导入方式,导入结果会直接返回给用户。可以打开 [group
commit](../import-way/group-commit-manual.md) 达到更高的性能。
```JSON
Query OK, 5 rows affected (0.308 sec)
@@ -132,6 +132,10 @@ MySQL> SELECT COUNT(*) FROM testdb.test_table2;
1 row in set (0.071 sec)
```
+4. 可以使用 [JOB](../../scheduler/job-scheduler.md) 异步执行 INSERT。
+
+5. 数据源可以是 [tvf](../../../lakehouse/file.md) 或者
[catalog](../../../lakehouse/database) 中的表。
+
### 查看导入作业
可以通过 show load 命令查看已完成的 INSERT INTO 任务。
@@ -382,6 +386,7 @@ INSERT INTO target_tbl SELECT k1,k2,k3 FROM
hive.db1.source_tbl limit 100;
INSERT 命令是同步命令,返回成功,即表示导入成功。
+
### 注意事项
- 必须保证外部数据源与 Doris 集群是可以互通,包括 BE 节点和外部数据源的网络是互通的。
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
index e0337368e3..7e056884b3 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
@@ -26,24 +26,33 @@ under the License.
Apache Doris 提供了多种导入和集成数据的方法,您可以使用合适的导入方式从各种源将数据导入到数据库中。Apache Doris
提供的数据导入方式可以分为四类:
-1. **实时写入**:应用程序通过 HTTP 或者 JDBC 实时写入数据到 Doris 表中,适用于需要实时分析和查询的场景。
- * 极少量数据(5 分钟一次)时可以使用 [JDBC INSERT](./import-way/insert-into-manual.md)
写入数据。
- * 并发较高或者频次较高(大于 20 并发或者 1 分钟写入多次)时建议打开 [Group
Commit](./import-way/group-commit-manual.md),使用 JDBC INSERT 或者 Stream Load 写入数据。
- * 吞吐较高时推荐使用 [Stream Load](./import-way/stream-load-manua) 通过 HTTP 写入数据。
+- **实时写入**:应用程序通过 HTTP 或者 JDBC 实时写入数据到 Doris 表中,适用于需要实时分析和查询的场景。
-2. **流式同步**:通过实时数据流(如 Flink、Kafka、事务数据库)将数据实时导入到 Doris 表中,适用于需要实时分析和查询的场景。
- * 可以使用 [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) 将
Flink 的实时数据流写入到 Doris 表中。
- * 可以使用 [Routine Load](./import-way/routine-load-manual.md) 或者 [Doris Kafka
Connector](../../ecosystem/doris-kafka-connector.md) 将 Kafka 的实时数据流写入到 Doris
表中。Routine Load 方式下,Doris 会调度任务将 Kafka 中的数据拉取并写入 Doris 中,目前支持 csv 和 json
格式的数据。Kafka Connector 方式下,由 Kafka 将数据写入到 Doris 中,支持 avro、json、csv、protobuf
格式的数据。
- * 可以使用 [Flink CDC](../../ecosystem/flink-doris-connector.md) 或 [
Datax](../../ecosystem/datax.md) 将事务数据库的 CDC 数据流写入到 Doris 中。
+ - 极少量数据(5 分钟一次)时可以使用 [JDBC INSERT](./import-way/insert-into-manual.md)
写入数据。
-3. **批量导入**:将数据从外部存储系统(如 S3、HDFS、本地文件、NAS)批量加载到 Doris 表中,适用于非实时数据导入的需求。
- * 可以使用 [Broker Load](./import-way/broker-load-manual.md) 将 S3 和 HDFS
中的文件写入到 Doris 中。
- * 可以使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将 S3、HDFS
和 NAS 中的文件同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。
- * 可以使用 [Stream Load](./import-way/stream-load-manua) 或者 [Doris
Streamloader](../../ecosystem/doris-streamloader.md) 将本地文件写入 Doris 中。
+ - 并发较高或者频次较高(大于 20 并发或者 1 分钟写入多次)时建议打开 [Group
Commit](./import-way/group-commit-manual.md),使用 JDBC INSERT 或者 Stream Load 写入数据。
-4. **外部数据源集成**:通过与外部数据源(如 Hive、JDBC、Iceberg 等)的集成,实现对外部数据的查询和部分数据导入到 Doris 表中。
- * 可以创建 [Catalog](../../lakehouse/lakehouse-overview.md) 读取外部数据源中的数据,使用
[INSERT INTO SELECT](./import-way/insert-into-manual.md) 将外部数据源中的数据同步写入到 Doris
中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。
- * 可以使用 [X2Doris](./migrate-data-from-other-olap.md) 将其他 AP 系统的数据迁移到 Doris
中。
+ - 吞吐较高时推荐使用 [Stream Load](./import-way/stream-load-manua) 通过 HTTP 写入数据。
+
+- **流式同步**:通过实时数据流(如 Flink、Kafka、事务数据库)将数据实时导入到 Doris 表中,适用于需要实时分析和查询的场景。
+
+ - 可以使用 [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) 将
Flink 的实时数据流写入到 Doris 表中。
+
+ - 可以使用 [Routine Load](./import-way/routine-load-manual.md) 或者 [Doris Kafka
Connector](../../ecosystem/doris-kafka-connector.md) 将 Kafka 的实时数据流写入到 Doris
表中。Routine Load 方式下,Doris 会调度任务将 Kafka 中的数据拉取并写入 Doris 中,目前支持 csv 和 json
格式的数据。Kafka Connector 方式下,由 Kafka 将数据写入到 Doris 中,支持 avro、json、csv、protobuf
格式的数据。
+
+ - 可以使用 [Flink CDC](../../ecosystem/flink-doris-connector.md) 或 [
Datax](../../ecosystem/datax.md) 将事务数据库的 CDC 数据流写入到 Doris 中。
+
+- **批量导入**:将数据从外部存储系统(如 S3、HDFS、本地文件、NAS)批量加载到 Doris 表中,适用于非实时数据导入的需求。
+ - 可以使用 [Broker Load](./import-way/broker-load-manual.md) 将 S3 和 HDFS
中的文件写入到 Doris 中。
+
+ - 可以使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将 S3、HDFS
和 NAS 中的文件同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。
+
+ - 可以使用 [Stream Load](./import-way/stream-load-manua) 或者 [Doris
Streamloader](../../ecosystem/doris-streamloader.md) 将本地文件写入 Doris 中。
+
+- **外部数据源集成**:通过与外部数据源(如 Hive、JDBC、Iceberg 等)的集成,实现对外部数据的查询和部分数据导入到 Doris 表中。
+ - 可以创建 [Catalog](../../lakehouse/lakehouse-overview.md) 读取外部数据源中的数据,使用
[INSERT INTO SELECT](./import-way/insert-into-manual.md) 将外部数据源中的数据同步写入到 Doris
中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。
+
+ - 可以使用 [X2Doris](./migrate-data-from-other-olap.md) 将其他 AP 系统的数据迁移到 Doris
中。
Doris 的每个导入默认都是一个隐式事务,事务相关的更多信息请参考[事务](../transaction.md)。
@@ -53,10 +62,10 @@ Doris 的导入主要涉及数据源、数据格式、导入方式、错误数
| 导入方式 | 使用场景
| 支持的文件格式 | 单次导入数据量 | 导入模式 |
| :-------------------------------------------- |
:----------------------------------------- | ----------------------- |
----------------- | -------- |
-| [Stream Load](./import-way/stream-load-manual) | 从本地数据导入
| csv、json、parquet、orc | 小于10GB | 同步 |
+| [Stream Load](./import-way/stream-load-manual) | 导入本地文件或者应用程序写入
| csv、json、parquet、orc | 小于10GB | 同步 |
| [Broker Load](./import-way/broker-load-manual.md) | 从对象存储、HDFS等导入
| csv、json、parquet、orc | 数十GB到数百 GB | 异步 |
-| [INSERT INTO VALUES](./import-way/insert-into-manual.md) |
<p>单条或小批量据量导入</p><p>通过JDBC等接口导入</p> | SQL | 简单测试用 |
同步 |
-| [INSERT INTO SELECT](./import-way/insert-into-manual.md) |
<p>Doris内部表之间数据导入</p><p>外部表导入</p> | SQL | 根据内存大小而定 |
同步 |
+| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | 通过JDBC等接口导入 | SQL
| 简单测试用 | 同步 |
+| [INSERT INTO SELECT](./import-way/insert-into-manual.md) |
可以导入外部表或者对象存储、HDFS中的文件 | SQL | 根据内存大小而定 | 同步 |
| [Routine Load](./import-way/routine-load-manual.md) | 从kakfa实时导入
| csv、json | 微批导入 MB 到 GB | 异步 |
| [MySQL Load](./import-way/mysql-load-manual.md) | 从本地数据导入
| csv | 小于10GB | 同步 |
| [Group Commit](./import-way/group-commit-manual.md) | 高频小批量导入
| 根据使用的导入方式而定 | 微批导入KB | -
|
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]