(doris-website) branch master updated: [doc](load) add load internals doc (#2799)

dataroaring Thu, 28 Aug 2025 05:24:33 -0700

This is an automated email from the ASF dual-hosted git repository.

dataroaring pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new f8f9acde61d [doc](load) add load internals doc (#2799)
f8f9acde61d is described below

commit f8f9acde61dd44da57b14c1d93aebcd69d2bf8b0
Author: Xin Liao <[email protected]>
AuthorDate: Thu Aug 28 20:22:46 2025 +0800

    [doc](load) add load internals doc (#2799)
    
    ## Versions
    
    - [x] dev
    - [ ] 3.0
    - [ ] 2.1
    - [ ] 2.0
    
    ## Languages
    
    - [ ] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 .../import/load-internals/load-internals.md        | 213 +++++++++++++++++++++
 .../docusaurus-plugin-content-docs/current.json    |   4 +
 .../import/load-internals/load-internas.md         | 211 ++++++++++++++++++++
 sidebars.json                                      |   9 +-
 4 files changed, 436 insertions(+), 1 deletion(-)

diff --git a/docs/data-operate/import/load-internals/load-internals.md 
b/docs/data-operate/import/load-internals/load-internals.md
new file mode 100644
index 00000000000..4f2b77420f3
--- /dev/null
+++ b/docs/data-operate/import/load-internals/load-internals.md
@@ -0,0 +1,213 @@
+---
+{
+    "title": "Load Internals and Performance Optimization",
+    "language": "en"
+}
+---
+
+## Overview
+
+Apache Doris is a high-performance distributed analytical database that adopts 
the MPP (Massively Parallel Processing) architecture and is widely used in 
real-time data analysis, data warehousing, and stream computing scenarios. Data 
loading is a core functionality of Doris that directly affects the real-time 
nature and accuracy of data analysis. An efficient loading mechanism ensures 
that large-scale data can enter the system quickly and reliably, providing 
support for subsequent querie [...]
+
+## Data Load Internals
+
+### Load Internals Overview
+
+Doris's data loading internals are built on its distributed architecture, 
mainly involving Frontend nodes (FE) and Backend nodes (BE). FE is responsible 
for metadata management, query parsing, task scheduling, and transaction 
coordination, while BE handles actual data storage, computation, and write 
operations. Doris's data loading design aims to meet diverse business needs, 
including real-time writing, streaming synchronization, batch loading, and 
external data source integration. Its c [...]
+
+- **Consistency and Atomicity**: Each load task acts as a transaction, 
ensuring atomic data writes and avoiding partial writes. The Label mechanism 
guarantees that loaded data is neither lost nor duplicated.
+- **Flexibility**: Supports multiple data sources (such as local files, HDFS, 
S3, Kafka, etc.) and formats (such as CSV, JSON, Parquet, ORC, etc.) to meet 
different scenarios.
+- **Efficiency**: Leverages distributed architecture for parallel data 
processing, with multiple BE nodes processing data in parallel to improve 
throughput.
+- **Simplicity**: Provides lightweight ETL functionality, allowing users to 
perform data cleaning and transformation directly during loading, reducing 
dependency on external tools.
+- **Flexible Modeling**: Supports detail models (Duplicate Key), primary key 
models (Unique Key), and aggregate models (Aggregate Key), allowing data 
aggregation or deduplication during loading.
+
+### General Load Process
+
+Doris's data loading process can be divided into several intuitive steps. 
Regardless of the loading method used (such as Stream Load, Broker Load, 
Routine Load, etc.), the core process is basically consistent.
+
+1. **Submit Load Task**
+   1. Users submit load requests through clients (such as HTTP, JDBC, MySQL 
client), specifying data sources (such as local files, Kafka Topics, HDFS file 
paths), target tables, file formats, and load parameters (such as delimiters, 
error tolerance).
+   2. Each task can specify a unique **Label** for task identification and 
idempotency support (preventing duplicate loads). For example, users specify 
Labels through HTTP headers in Stream Load.
+   3. Doris's Frontend node (FE) receives the request, validates permissions, 
checks if the target table exists, and parses load parameters.
+
+2. **Task Assignment and Coordination**
+   1. FE analyzes data distribution (based on table partitioning and bucketing 
rules), generates a load plan, and selects a Backend node (BE) as the 
**Coordinator** to coordinate the entire task.
+   2. If users submit directly to BE (such as Stream Load), BE can directly 
serve as Coordinator but still needs to obtain metadata (such as table Schema) 
from FE.
+   3. The load plan distributes data to multiple BE nodes, ensuring parallel 
processing to improve efficiency.
+
+3. **Data Reading and Distribution**
+   1. Coordinator BE reads data from data sources (for example, pulling 
messages from Kafka, reading files from S3, or directly receiving HTTP data 
streams).
+   2. Doris parses data formats (such as CSV splitting, JSON parsing) and 
supports user-defined **lightweight ETL** operations, including:
+      - **Pre-filtering**: Filters raw data to reduce processing overhead.
+      - **Column mapping**: Adjusts the correspondence between data columns 
and target table columns.
+      - **Data transformation**: Processes data through expressions.
+      - **Post-filtering**: Filters transformed data.
+   3. After parsing data, Coordinator BE distributes it to multiple downstream 
Executor BEs according to partitioning and bucketing rules.
+
+4. **Data Writing**
+   1. Data is distributed to multiple BE nodes and written to memory tables 
(MemTable), sorted by Key columns. For Aggregate or Unique Key models, Doris 
performs aggregation or deduplication according to Keys (such as SUM, REPLACE).
+   2. When MemTable is full (default 200MB) or the task ends, data is 
asynchronously written to disk, forming columnar storage **Segment files** and 
composing **Rowsets**.
+   3. Each BE independently processes assigned data and reports status to the 
Coordinator after writing is complete.
+
+5. **Transaction Commit and Publishing**
+   1. Coordinator initiates transaction commit (Commit) to FE. After FE 
ensures that most replicas are successfully written, it notifies BE to publish 
data versions (Publish Version). After BE Publish succeeds, FE marks the 
transaction as **VISIBLE**, at which point data becomes queryable.
+   2. If it fails, FE triggers rollback (Rollback), deletes temporary data, 
and ensures data consistency.
+
+6. **Result Return**
+   1. Synchronous methods (such as Stream Load, Insert Into) directly return 
load results, including success/failure status and error details (such as 
ErrorURL).
+   2. Asynchronous methods (such as Broker Load) provide task IDs and Labels. 
Users can view progress, error row counts, and detailed information through 
SHOW LOAD.
+   3. Operations are recorded in audit logs for subsequent tracing.
+
+### Memtable Forwarding
+
+Memtable forwarding is an optimization mechanism introduced in Apache Doris 
2.1.0 that significantly improves performance for INSERT INTO…SELECT load 
methods. Official tests show that load time is reduced to 36% in single-replica 
scenarios and 54% in three-replica scenarios, with overall performance 
improvements exceeding 100%. In traditional processes, Sink nodes need to 
encode data into Block format and transmit to downstream nodes through 
Ping-pong RPC, involving multiple encoding and [...]
+
+### Separation of Storage and Compute Load
+
+In storage-compute separation architecture, load optimization focuses on 
decoupling data storage and transaction management:
+
+- **Data Storage**: BE does not persist data. After MemTable flush, Segment 
files are directly uploaded to shared storage (such as S3, HDFS), leveraging 
object storage's high availability and low cost to support elastic scaling. BE 
local File Cache asynchronously caches hot data, improving query hit rates 
through TTL and Warmup strategies. Metadata (such as Tablet, Rowset metadata) 
is stored by Meta Service in FoundationDB rather than BE local RocksDB.
+- **Transaction Processing**: Transaction management migrates from FE to Meta 
Service, eliminating FE Edit Log write bottlenecks. Meta Service manages 
transactions through standard interfaces (beginTransaction, commitTransaction), 
relying on FoundationDB's global transaction capabilities to ensure 
consistency. BE coordinators directly interact with Meta Service, recording 
transaction states and handling conflicts and timeout recovery through atomic 
operations, simplifying synchronization [...]
+
+### Load Methods
+
+Doris provides multiple load methods that share the above principles but are 
optimized for different scenarios. Users can choose based on data sources and 
business needs:
+
+- **Stream Load**: Load local files or data streams through HTTP, returning 
results synchronously, suitable for real-time writing (such as application data 
pushing).
+- **Broker Load**: Load HDFS, S3, and other external storage through SQL, 
executing asynchronously, suitable for large-scale batch loads.
+- **Routine Load**: Continuously consume data from Kafka, asynchronous 
streaming load with Exactly-Once support, suitable for real-time 
synchronization of message queue data.
+- **Insert Into/Select**: Load from Doris tables or external sources (such as 
Hive, MySQL, S3 TVF) through SQL, suitable for ETL jobs and external data 
integration.
+- **MySQL Load**: Compatible with MySQL LOAD DATA syntax, loads local CSV 
files with data forwarded through FE as Stream Load, suitable for small-scale 
testing or MySQL user migration.
+
+## How to Improve Doris Load Performance
+
+Doris's load performance is affected by its distributed architecture and 
storage mechanisms, with core aspects involving FE metadata management, BE 
parallel processing, MemTable cache flushing, and transaction management. The 
following optimization strategies and their effectiveness are explained from 
the dimensions of table structure design, batching strategies, bucket 
configuration, memory management, and concurrency control, combined with load 
principles.
+
+### **Table Structure Design Optimization: Reduce Distribution Overhead and 
Memory Pressure**
+
+In Doris's load process, data needs to be parsed by FE and then distributed to 
Tablets (data shards) on BE nodes according to table partitioning and bucketing 
rules, cached and sorted in BE memory through MemTable, and then flushed to 
disk to generate Segment files. Table structure (partitioning, models, indexes) 
directly affects data distribution efficiency, computational load, and storage 
fragmentation.
+
+- **Partition Design: Isolate Data Ranges, Reduce Distribution and Memory 
Pressure**
+
+By partitioning according to business query patterns (such as time, region), 
data is only distributed to target partitions during loading, avoiding 
processing metadata and files from unrelated partitions. Writing to multiple 
partitions simultaneously causes many Tablets to be active, with each Tablet 
occupying independent MemTable, significantly increasing BE memory pressure and 
potentially triggering early Flush, generating numerous small Segment files. 
This not only increases disk or o [...]
+
+- **Model Selection: Reduce Computational Load, Accelerate Writing**
+
+Detail models (Duplicate Key) only store raw data without aggregation or 
deduplication computation; while Aggregate models need aggregation by Key 
columns and Unique Key models need deduplication, both increasing CPU and 
memory consumption. For scenarios without deduplication or aggregation needs, 
prioritizing detail models can avoid additional computation (such as sorting, 
deduplication) at the MemTable stage on BE nodes, reducing memory usage and CPU 
pressure, accelerating the data wri [...]
+
+- **Index Control: Balance Query and Write Overhead**
+
+Indexes (such as bitmap indexes, inverted indexes) need synchronous updates 
during loading, increasing maintenance costs during writing. Creating indexes 
only for high-frequency query fields and avoiding redundant indexes can reduce 
index update operations (such as index building, verification) during BE 
writing, reducing CPU and memory usage and improving load throughput.
+
+### **Batching Optimization: Reduce Transactions and Storage Fragmentation**
+
+Each load task in Doris is an independent transaction, involving FE Edit Log 
writing (recording metadata changes) and BE MemTable flushing (generating 
Segment files). High-frequency small-batch loads (such as KB-level) cause 
frequent Edit Log writing (increasing FE disk I/O) and frequent MemTable 
flushing (generating numerous small Segment files, triggering Compaction write 
amplification), significantly degrading performance.
+
+- **Client-side Batching: Reduce Transaction Count, Lower Metadata Overhead**
+
+Clients accumulate data to hundreds of MB to several GB before loading at 
once, reducing transaction count. Single large transactions replacing multiple 
small transactions can reduce FE Edit Log write frequency (reducing metadata 
operations) and BE MemTable flush frequency (reducing small file generation), 
avoiding storage fragmentation and subsequent Compaction resource consumption.
+
+- **Server-side Batching (Group Commit): Merge Small Transactions, Optimize 
Storage Efficiency**
+
+After enabling Group Commit, the server merges multiple small-batch loads 
within a short time into a single transaction, reducing Edit Log write count 
and MemTable flush frequency. Merged large transactions generate larger Segment 
files (reducing small files), alleviating background Compaction pressure, 
particularly suitable for high-frequency small-batch scenarios (such as 
logging, IoT data writing).
+
+### **Bucket Count Optimization: Balance Load and Distribution Efficiency**
+
+Bucket count determines Tablet count (each bucket corresponds to one Tablet), 
directly affecting data distribution on BE nodes. Too few buckets easily cause 
data skew (single BE overloaded), while too many buckets increase metadata 
management and distribution overhead (BE needs to handle more Tablets' MemTable 
and Segment files).
+
+- **Reasonable Bucket Count Configuration: Ensure Balanced Tablet Size**
+
+Bucket count should be set according to BE node count and data volume, with 
recommended single Tablet compressed data size of 1-10GB (calculation formula: 
bucket count = total data volume / (1-10GB)). Simultaneously, adjust bucket 
keys (such as random number columns) to avoid data skew. Reasonable bucketing 
can balance BE node load, avoiding single node overload or multi-node resource 
waste, improving parallel write efficiency.
+
+- **Random Bucketing Optimization: Reduce RPC Overhead and Compaction 
Pressure**
+
+In random bucketing scenarios, enabling `load_to_single_tablet=true` can write 
data directly to a single Tablet, bypassing distribution to multiple Tablets. 
This eliminates CPU overhead for computing Tablet distribution and RPC 
transmission overhead between BEs, significantly improving write speed. 
Simultaneously, concentrated writing to a single Tablet reduces small Segment 
file generation, avoids frequent Compaction-induced write amplification, 
reduces BE resource consumption and stora [...]
+
+### **Memory Optimization: Reduce Flushing and Resource Impact**
+
+During data loading, BE first writes data to memory MemTable (default 200MB), 
then asynchronously flushes to disk to generate Segment files (triggering disk 
I/O) when full. High-frequency flushing increases disk or object storage 
(storage-compute separation scenarios) I/O pressure; insufficient memory causes 
MemTable dispersion (in multi-partition/bucket scenarios), easily triggering 
frequent flushing or OOM.
+
+- **Sequential Load by Partition: Concentrate Memory Usage**
+
+Loading by partition sequence (such as daily), concentrating data writing to a 
single partition, reduces MemTable dispersion (multi-partitions need MemTable 
allocation for each partition) and flush frequency, reducing memory 
fragmentation and I/O pressure.
+
+- **Large-scale Data Batch Load: Reduce Resource Impact**
+
+For large file or multi-file loads (such as Broker Load), recommend batching 
(≤100GB per batch) to avoid high retry costs after load errors while reducing 
concentrated occupation of BE memory and disk. Local large files can use the 
`streamloader` tool for automatic batch loading.
+
+### **Concurrency Optimization: Balance Throughput and Resource Competition**
+
+Doris's distributed architecture supports multi-BE parallel writing. 
Increasing concurrency can improve throughput, but excessive concurrency causes 
CPU, memory, or object storage QPS competition (storage-compute separation 
scenarios need to consider QPS limits of APIs like S3), increasing transaction 
conflicts and latency.
+
+- **Reasonable Concurrency Control: Match Hardware Resources**
+
+Set concurrent threads based on BE node count and hardware resources (CPU, 
memory, disk I/O). Moderate concurrency can fully utilize BE parallel 
processing capabilities, improving throughput; excessive concurrency reduces 
efficiency due to resource competition.
+
+- **Low Latency Scenarios: Reduce Concurrency and Asynchronous Submission**
+
+For low latency requirement scenarios (such as real-time monitoring), reduce 
concurrency count (avoiding resource competition) and combine Group Commit's 
asynchronous mode (`async_mode`) to merge small transactions, reducing 
transaction commit latency.
+
+## Doris Data Load Latency and Throughput Trade-offs
+
+When using Apache Doris, data load **Latency** and **Throughput** often need 
to be balanced in actual business scenarios:
+
+- **Lower Latency**: Means users can see the latest data faster, but smaller 
write batches and higher write frequency lead to more frequent background 
Compaction, consuming more CPU, IO, and memory resources while increasing 
metadata management pressure.
+- **Higher Throughput**: Reduces load count by increasing single load data 
volume, which can significantly reduce metadata pressure and background 
Compaction overhead, thus improving overall system performance. However, 
latency between data writing and visibility will increase.
+
+Therefore, it's recommended that users **maximize single load data volume** 
while meeting business latency requirements to improve throughput and reduce 
system overhead.
+
+### Test Data
+
+#### Flink End-to-End Latency
+
+Using Flink Connector with batching mode for writing, mainly focusing on 
end-to-end latency and load throughput. Batching time is controlled by the 
Flink Connector's sink.buffer-flush.interval parameter. For detailed usage of 
Flink Connector, refer to 
[Flink-Doris-Connector](../../../ecosystem/flink-doris-connector#usage).
+
+**Machine Configuration:**
+
+- 1 FE: 8-core CPU, 16GB memory
+- 3 BEs: 16-core CPU, 64GB memory
+
+**Dataset:**
+
+- TPCH lineitem data
+
+Load performance under different batching times and concurrency levels, test 
results:
+
+| Batch Time (s) | Load Concurrency | Bucket Count | Throughput (rows/s) | 
End-to-End Avg Latency (s) | End-to-End P99 Latency (s) |
+| -------------- | ---------------- | ------------ | ------------------- | 
-------------------------- | -------------------------- |
+| 0.2            | 1                | 32           | 6073                | 
0.211                      | 0.517                      |
+| 1              | 1                | 32           | 31586               | 
0.71                       | 1.39                       |
+| 10             | 1                | 32           | 67437               | 
5.65                       | 10.90                      |
+| 20             | 1                | 32           | 93769               | 
10.962                     | 20.682                     |
+| 60             | 1                | 32           | 125000              | 
32.46                      | 62.17                      |
+| 0.2            | 10               | 32           | 9300                | 
0.38                       | 0.704                      |
+| 1              | 10               | 32           | 34633               | 
0.75                       | 1.47                       |
+| 10             | 10               | 32           | 82023               | 
5.44                       | 10.43                      |
+| 20             | 10               | 32           | 139731              | 
11.12                      | 22.68                      |
+| 60             | 10               | 32           | 171642              | 
32.37                      | 61.93                      |
+
+Impact of different bucket counts on load performance, test results:
+
+| Batch Time (s) | Load Concurrency | Bucket Count | Throughput (rows/s) | 
End-to-End Avg Latency (s) | End-to-End P99 Latency (s) |
+| -------------- | ---------------- | ------------ | ------------------- | 
-------------------------- | -------------------------- |
+| 1              | 10               | 4            | 34722               | 
0.86                       | 2.28                       |
+| 1              | 10               | 16           | 34526               | 0.8 
                       | 1.52                       |
+| 1              | 10               | 32           | 34633               | 
0.75                       | 1.47                       |
+| 1              | 10               | 64           | 34829               | 
0.81                       | 1.51                       |
+| 1              | 10               | 128          | 34722               | 
0.83                       | 1.55                       |
+
+#### GroupCommit Testing
+
+For Group Commit performance test data, refer to [Group Commit 
Performance](../group-commit-manual.md#performance)
+
+## Summary
+
+Apache Doris's data load mechanism relies on distributed collaboration between 
FE and BE, combined with transaction management and lightweight ETL 
functionality, ensuring efficient and reliable data writing. Frequent 
small-batch loads increase transaction overhead, storage fragmentation, and 
Compaction pressure. The following optimization strategies can effectively 
alleviate these issues:
+
+- **Table Structure Design**: Reasonable partitioning and detail models reduce 
scanning and computational overhead, streamlined indexes reduce write burden.
+- **Batching Optimization**: Client-side and server-side batching reduce 
transaction and flush frequency, generate large files, optimize storage and 
queries.
+- **Bucket Count Optimization**: Appropriate bucketing balances load, avoiding 
hotspots or management overhead.
+- **Memory Optimization**: Control MemTable size, load by partition.
+- **Concurrency Optimization**: Moderate concurrency improves throughput, 
combined with batching and resource monitoring to control latency.
+
+Users can combine these strategies according to business scenarios (such as 
real-time logging, batch ETL), optimize table design, parameter configuration, 
and resource allocation to significantly improve load performance.
diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current.json 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current.json
index 85252898a68..d68e0134005 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-docs/current.json
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current.json
@@ -123,6 +123,10 @@
     "message": "复杂数据类型",
     "description": "The label for category Complex Data Types in sidebar docs"
   },
+  "sidebar.docs.category.Load Internals": {
+    "message": "导入原理",
+    "description": "The label for category Load Internals in sidebar docs"
+  },
   "sidebar.docs.category.Updating Data": {
     "message": "数据更新",
     "description": "The label for category Update in sidebar docs"
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-internals/load-internas.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-internals/load-internas.md
new file mode 100644
index 00000000000..67b4620b708
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-internals/load-internas.md
@@ -0,0 +1,211 @@
+---
+{
+
+    "title": "导入原理与性能调优",
+    "language": "zh-CN"
+
+}
+---
+
+
+## 概述
+
+Apache Doris 是一个高性能的分布式分析型数据库，采用 MPP（大规模并行处理）架构，广泛应用于实时数据分析、数据仓库和流计算等场景。数据导入是 
Doris 的核心功能，直接影响数据分析的实时性和准确性。高效的导入机制能够确保大规模数据快速、可靠地进入系统，为后续查询提供支持。本文将剖析 Doris 
数据导入的通用原理，涵盖关键流程、组件、事务管理等，探讨影响导入性能的因素，并提供实用的优化方法和最佳实践，有助于用户选择合适的导入策略，优化导入性能。
+
+## 数据导入原理
+
+### 导入原理概述
+
+Doris 的数据导入原理建立在其分布式架构之上，主要涉及前端节点（Frontend, FE）和后端节点（Backend, BE）。FE 
负责元数据管理、查询解析、任务调度和事务协调，而 BE 则处理实际的数据存储、计算和写入操作。Doris 
的数据导入设计旨在满足多样化的业务需求，包括实时写入、流式同步、批量加载和外部数据源集成。其核心理念包括：
+
+- **一致性与原子性**：每个导入任务作为一个事务，确保数据原子写入，避免部分写入。通过 Label 机制保证导入数据的不丢不重。
+- **灵活性**：支持多种数据源（如本地文件、HDFS、S3、Kafka 等）和格式（如 CSV、JSON、Parquet、ORC 等），满足不同场景。
+- **高效性**：利用分布式架构并行处理数据，多 BE 节点并行处理数据，提高吞吐量。
+- **简易性**：提供轻量级 ETL 功能，用户可在导入时直接进行数据清洗和转换，减少外部工具依赖。
+- **灵活建模**：支持明细模型（Duplicate Key）、主键模型（Unique Key）和聚合模型（Aggregate 
Key），允许在导入时进行数据聚合或去重。
+
+### 导入通用流程
+
+Doris 的数据导入过程可以分为以下几个直观的步骤，无论使用何种导入方式（如 Stream Load、Broker Load、Routine Load 
等），核心流程基本一致。
+
+1. **提交导入任务**
+   1. 用户通过客户端（如 HTTP、JDBC、MySQL 客户端）提交导入请求，指定数据源（如本地文件、Kafka Topic、HDFS 
文件路径）、目标表、文件格式和导入参数（如分隔符、错误容忍度）。
+   2. 每个任务可以指定一个唯一的 **Label**，用于标识任务并支持幂等性（防止重复导入）。例如，用户在 Stream Load 中通过 HTTP 
header 指定 Label。
+   3. Doris 的前端节点（FE）接收请求，验证权限、检查目标表是否存在，并解析导入参数。
+2. **任务分配与协调**
+   1. FE 分析数据分布（基于表的分区和分桶规则），生成导入计划，并选择一个后端节点（BE）作为 **Coordinator**，负责协调整个任务。
+   2. 如果用户直接向 BE 提交（如 Stream Load），BE 可直接担任 Coordinator，但仍需从 FE 获取元数据（如表 
Schema）。
+   3. 导入计划会将数据分配到多个 BE 节点，确保并行处理以提高效率。
+3. **数据读取与分发**
+   1. Coordinator BE 从数据源读取数据（例如，从 Kafka 拉取消息、从 S3 读取文件，或直接接收 HTTP 数据流）。
+   2. Doris 解析数据格式（如对 CSV 分割、JSON 解析），并支持用户定义的 **轻量 ETL** 操作，包括：
+      - **前置过滤**：对原始数据进行过滤，减少处理开销。
+      - **列映射**：调整数据列与目标表列的对应关系。
+      - **数据转换**：通过表达式处理数据。
+      - **后置过滤**：对转换后的数据进行过滤。
+   3. Coordinator BE 解析完数据后按分区和分桶规则分发到多个下游的 Executor BE。
+4. **数据写入**
+   1. 数据分发到多个 BE 节点，写入内存表（MemTable），按 Key 列进行排序。对于 Aggregate 或 Unique Key 
模型，Doris 会根据 Key 进行聚合或去重（如 SUM、REPLACE）。
+   2. 当 MemTable 写满（默认 200MB）或任务结束时，数据异步写入磁盘，形成列式存储的 **Segment 文件**，并组成 
**Rowset**。
+   3. 每个 BE 独立处理分配的数据，写入完成后向 Coordinator 报告状态。
+5. **事务提交与发布**
+   1. Coordinator 向 FE 发起事务提交（Commit）。FE 确保多数副本成功写入后，通知 BE 发布数据版本（Publish 
Version），等 BE Publish 成功后，FE 标记事务为 **VISIBLE**，此时数据可查询。
+   2. 如果失败，FE 触发回滚（Rollback），删除临时数据，确保数据一致性。
+6. **结果返回**
+   1. 同步方式（如 Stream Load、Insert Into）直接返回导入结果，包含成功/失败状态和错误详情（如 ErrorURL）。
+   2. 异步方式（如 Broker Load）提供任务 ID 和 Label，用户可通过 SHOW LOAD 查看进度、错误行数和详细信息。
+   3. 操作记录到审计日志，支持后续追溯。
+
+### Memtable 前移
+
+Memtable 前移是 Apache Doris 2.1.0 版本引入的优化机制，针对 INSERT INTO…SELECT 
导入方式显著提升性能，官方测试显示在单副本场景下导入耗时降低至 36%，三副本场景下降低至 54%，整体性能提升超 100%。传统流程中，Sink 
节点需将数据编码为 Block 格式，通过 Ping-pong RPC 传输到下游节点，涉及多次编码和解码，增加开销。Memtable 
前移优化了这一过程：Sink 节点直接处理 MemTable，生成 Segment 数据后通过 Streaming RPC 
传输，减少编码解码和传输等待，同时提供更准确的内存反压。目前该功能只支持存算一体部署模式。
+
+### 存算分离导入
+
+在存算分离架构下，导入优化聚焦数据存储和事务管理解耦：
+
+- **数据存储**：BE 不持久化数据，MemTable flush 后生成 Segment 文件直接上传至共享存储（如 
S3、HDFS），利用对象存储的高可用性和低成本支持弹性扩展。BE 本地 File Cache 异步缓存热点数据，通过 TTL 和 Warmup 
策略提升查询命中率。元数据（如 Tablet、Rowset 元数据）由 Meta Service 存储于 FoundationDB，而非 BE 本地 
RocksDB。
+- **事务处理**：事务管理从 FE 迁移至 Meta Service，消除 FE Edit Log 写入瓶颈。Meta Service 
通过标准接口（beginTransaction、commitTransaction）管理事务，依赖 FoundationDB 的全局事务能力确保一致性。BE 
协调者直接与 Meta Service 交互，记录事务状态，通过原子操作处理冲突和超时回收，简化同步逻辑，提升高并发小批量导入吞吐量。
+
+### 导入方式
+
+Doris 提供多种导入方式，共享上述原理，但针对不同场景优化。用户可根据数据源和业务需求选择：
+
+- **Stream Load**: 通过 HTTP 导入本地文件或数据流，同步返回结果，适合实时写入（如应用程序推送数据）。
+- **Broker Load**: 通过 SQL 导入 HDFS、S3 等外部存储，异步执行，适合大规模批量导入。
+- **Routine Load**: 从 Kafka 持续消费数据，异步流式导入，支持 Exactly-Once，适合实时同步消息队列数据。
+- **Insert Into/Select**: 通过 SQL 从 Doris 表或外部源（如 Hive、MySQL、S3 TVF）导入，适合 ETL 
作业、外部数据集成。
+- **MySQL Load**: 兼容 MySQL LOAD DATA 语法，导入本地 CSV 文件，数据经 FE 转发为 Stream 
Load，适合小规模测试或 MySQL 用户迁移。
+
+## 如何提升 Doris 导入性能
+
+Doris 的导入性能受其分布式架构与存储机制影响，核心涉及 FE 元数据管理、BE 并行处理、MemTable 
缓存刷盘及事务管理等环节。以下从表结构设计、攒批策略、分桶配置、内存管理和并发控制等维度，结合导入原理说明优化策略及有效性。
+
+### **表结构设计优化：降低分发开销与内存压力**
+
+Doris 的导入流程中，数据需经 FE 解析后，按表的分区和分桶规则分发至 BE 节点的 Tablet（数据分片），并在 BE 内存中通过 
MemTable 缓存、排序后刷盘生成 Segment 文件。表结构（分区、模型、索引）直接影响数据分发效率、计算负载和存储碎片。
+
+- **分区设计：隔离数据范围，减少分发与内存压力**
+
+通过按业务查询模式（如时间、区域）划分分区，导入时数据仅分发至目标分区，避免处理无关分区的元数据和文件。同时写入多个分区会导致大量 Tablet 活跃，每个 
Tablet 占用独立的 MemTable，显著增加 BE内存压力，可能触发提前 Flush，生成大量小 Segment 文件。这不仅增加磁盘或对象存储的 
I/O 开销，还因小文件引发频繁 Compaction 和写放大，降低性能。通过限制活跃分区数量（如逐天导入），可减少同时活跃的 Tablet 
数，缓解内存紧张，生成更大的 Segment 文件，降低 Compaction 负担，从而提升并行写入效率和后续查询性能。
+
+- **模型选择：减少计算负载，加速写入**
+
+明细模型（Duplicate Key）仅存储原始数据，无需聚合或去重计算；而 Aggregate 模型需按 Key 列聚合，Unique Key 
模型需去重，均会增加 CPU 和内存消耗。对于无需去重或聚合的场景，优先使用明细模型，可避免 BE 节点在 MemTable 
阶段的额外计算（如排序、去重），降低内存占用和 CPU 压力，加速数据写入流程。
+
+- **索引控制：平衡查询与写入开销**
+
+索引（如位图索引、倒排索引）需在导入时同步更新，增加写入时的维护成本。仅为高频查询字段创建索引，避免冗余索引，可减少 BE 
写入时的索引更新操作（如索引构建、校验），降低 CPU 和内存占用，提升导入吞吐量。
+
+### **攒批优化：减少事务与存储碎片**
+
+Doris 的每个导入任务为独立事务，涉及 FE 的 Edit Log 写入（记录元数据变更）和 BE 的 MemTable 刷盘（生成 Segment 
文件）。高频小批量导入（如 KB 级别）会导致 Edit Log 频繁写入（增加 FE 磁盘 I/O）、MemTable 频繁刷盘（生成大量小 Segment 
文件，触发 Compaction 写放大），显著降低性能。
+
+- **客户端攒批：减少事务次数，降低元数据开销**
+
+客户端将数据攒至数百 MB 到数 GB 后一次性导入，减少事务次数。单次大事务替代多次小事务，可降低 FE 的 Edit Log 
写入频率（减少元数据操作）及 BE 的 MemTable 刷盘次数（减少小文件生成），避免存储碎片和后续 Compaction 的资源消耗。
+
+- **服务端攒批（Group Commit）：合并小事务，优化存储效率**
+
+开启 Group Commit 后，服务端将短时间内的多个小批量导入合并为单一事务，减少 Edit Log 写入次数和 MemTable 
刷盘频率。合并后的大事务生成更大的 Segment 文件（减少小文件），减轻后台 Compaction 压力，特别适用于高频小批量场景（如日志、IoT 
数据写入）。
+
+### **分桶数优化：平衡负载与分发效率**
+
+分桶数决定 Tablet数量（每个桶对应一个 Tablet），直接影响数据在 BE 节点的分布。过少分桶易导致数据倾斜（单 BE 
负载过高），过多分桶会增加元数据管理和分发开销（BE 需处理更多 Tablet 的 MemTable 和 Segment 文件）。
+
+- **合理配置分桶数：确保 Tablet 大小均衡**
+
+分桶数需根据 BE 节点数量和数据量设置，推荐单 Tablet 压缩后的数据大小为 
1-10GB（计算公式：分桶数=总数据量/(1-10GB)）。同时，调整分桶键（如随机数列）避免数据倾斜。合理分桶可平衡 BE 
节点负载，避免单节点过载或多节点资源浪费，提升并行写入效率。
+
+- **随机分桶优化：减少 RPC 开销与 Compaction 压力**
+
+在随机分桶场景中，启用 `load_to_single_tablet=true`，可将数据直接写入单一 Tablet，绕过分发到多个 Tablet 
的过程。这消除了计算 Tablet 分布的 CPU 开销和 BE 间的 RPC 传输开销，显著提升写入速度。同时，集中写入单一 Tablet 减少了小 
Segment 文件的生成，避免频繁 Compaction 带来的写放大，降低 BE 的资源消耗和存储碎片，提升导入和查询效率。
+
+### **内存优化：减少刷盘与资源冲击**
+
+数据导入时，BE 先将数据写入内存的 MemTable（默认 200MB），写满后异步刷盘生成 Segment 文件（触发磁盘 
I/O）。高频刷盘会增加磁盘或对象存储（存算分离场景）的 I/O 压力；内存不足则导致 MemTable 分散（多分区/分桶时），易触发频繁刷盘或 OOM。
+
+- **按分区顺序导入：集中内存使用**
+
+按分区顺序（如逐天）导入，集中数据写入单一分区，减少 MemTable 分散（多分区需为每个分区分配 MemTable）和刷盘次数，降低内存碎片和 I/O 
压力。
+
+- **大规模数据分批导入：降低资源冲击**
+
+对大文件或多文件导入（如 Broker Load），建议分批（每批≤100GB），避免导入出错后重试代价过大，同时减少对 BE 
内存和磁盘的集中占用。本地大文件可使用 `streamloader` 工具自动分批导入。
+
+### **并发优化：平衡吞吐量与资源竞争**
+
+Doris 的分布式架构支持多 BE 并行写入，增加并发可提升吞吐量，但过高并发会导致 CPU、内存或对象存储 QPS 争抢（存算分离场景需考虑 S3 等 
API 的 QPS 限制），增加事务冲突和延迟。
+
+- **合理控制并发：匹配硬件资源**
+
+结合 BE 节点数和硬件资源（CPU、内存、磁盘 I/O）设置并发线程。适度并发可充分利用 BE 并行处理能力，提升吞吐量；过高并发则因资源争抢降低效率。
+
+- **低时延场景：降低并发与异步提交**
+
+对低时延要求场景（如实时监控），需降低并发数（避免资源竞争），并结合 Group Commit 
的异步模式（`async_mode`）合并小事务，减少事务提交延迟。
+
+## Doris 数据导入的延迟与吞吐取舍
+
+在使用 Apache Doris 时，数据导入的 **延迟（Latency）** 与 **吞吐量（Throughput）** 
往往需要在实际业务场景中进行平衡：
+
+- **更低延迟**：意味着用户能更快看到最新数据，但写入批次更小，写入频率更高，会导致后台 Compaction 更频繁，占用更多 CPU、IO 
和内存资源，同时增加元数据管理的压力。
+- **更高吞吐**：则通过增大单次导入数据量来减少导入次数，可以显著降低元数据压力和后台 Compaction 
开销，从而提升系统整体性能。但数据写入到可见之间的延迟会有所增加。
+
+因此，建议用户在满足业务时延要求的前提下，**尽量增大单次导入写入的数据量**，以提升吞吐并减少系统开销。
+
+### 测试数据
+
+#### Flink 端到端时延
+
+采用 Flink Connector 使用攒批模式进行写入，主要关注数据端到端的时延和导入吞吐。攒批时间通过 Flink Connector 的 
sink.buffer-flush.interval 参数来控制，Flink Connector 的详细使用参考 
[Flink-Doris-Connector](../../../ecosystem/flink-doris-connector#使用说明)。
+
+**机器配置：**
+
+- 1 台 FE： 8 核 CPU、16GB 内存
+- 3 台 BE：16 核 CPU、64GB 内存
+
+**数据集：**
+
+- TPCH lineitem 数据
+
+不同攒批时间和不同并发下的导入性能，测试结果：
+
+| 攒批时间（s） | 导入并发 | bucket数 | 吞吐（rows/s） | 端到端平均时延（s） | 端到端P99时延（s） |
+| ------------- | -------- | -------- | -------------- | ------------------- | 
------------------ |
+| 0.2           | 1        | 32       | 6073           | 0.211               | 
0.517              |
+| 1             | 1        | 32       | 31586          | 0.71                | 
1.39               |
+| 10            | 1        | 32       | 67437          | 5.65                | 
10.90              |
+| 20            | 1        | 32       | 93769          | 10.962              | 
20.682             |
+| 60            | 1        | 32       | 125000         | 32.46               | 
62.17              |
+| 0.2           | 10       | 32       | 9300           | 0.38                | 
0.704              |
+| 1             | 10       | 32       | 34633          | 0.75                | 
1.47               |
+| 10            | 10       | 32       | 82023          | 5.44                | 
10.43              |
+| 20            | 10       | 32       | 139731         | 11.12               | 
22.68              |
+| 60            | 10       | 32       | 171642         | 32.37               | 
61.93              |
+
+不同 bucket 数对导入性能的影响，测试结果：
+
+| 攒批时间（s） | 导入并发 | bucket数 | 吞吐（rows/s） | 端到端平均时延（s） | 端到端P99时延（s） |
+| ------------- | -------- | -------- | -------------- | ------------------- | 
------------------ |
+| 1             | 10       | 4        | 34722          | 0.86                | 
2.28               |
+| 1             | 10       | 16       | 34526          | 0.8                 | 
1.52               |
+| 1             | 10       | 32       | 34633          | 0.75                | 
1.47               |
+| 1             | 10       | 64       | 34829          | 0.81                | 
1.51               |
+| 1             | 10       | 128      | 34722          | 0.83                | 
1.55               |
+
+#### GroupCommit 测试
+
+Group Commit 性能测试数据参考 [Group Commit 性能](../group-commit-manual.md#性能)
+
+## 总结
+
+Apache Doris 的数据导入机制依托 FE 和 BE 的分布式协作，结合事务管理和轻量 ETL 
功能，确保高效、可靠的数据写入。频繁小批量导入会增加事务开销、存储碎片和 Compaction 压力，通过以下优化策略可有效缓解：
+
+- **表结构设计**：合理分区和明细模型减少扫描和计算开销，精简索引降低写入负担。
+- **攒批优化**：客户端和服务端攒批减少事务和 flush 频率，生成大文件，优化存储和查询。
+- **分桶数优化**：适量分桶平衡负载，避免热点或管理开销。
+- **内存优化**：控制 MemTable 大小、按分区导入。
+- **并发优化**：适度并发提升吞吐量，结合分批和资源监控控制延迟。
+
+用户可根据业务场景（如实时日志、批量 ETL）结合这些策略，优化表设计、参数配置和资源分配，显著提升导入性能。
\ No newline at end of file
diff --git a/sidebars.json b/sidebars.json
index 193625bd5a2..8a597b3c663 100644
--- a/sidebars.json
+++ b/sidebars.json
@@ -214,7 +214,14 @@
                         "data-operate/import/load-data-convert",
                         "data-operate/import/load-high-availability",
                         "data-operate/import/group-commit-manual",
-                        "data-operate/import/load-best-practices"
+                        "data-operate/import/load-best-practices",
+                        {
+                            "type": "category",
+                            "label": "Load Internals",
+                            "items": [
+                                
"data-operate/import/load-internals/load-internals"
+                            ]
+                        }
                     ]
                 },
                 {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [doc](load) add load internals doc (#2799)

Reply via email to