This is an automated email from the ASF dual-hosted git repository.
kassiez pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 9e633f0c9ca [doc](update) Address comment on update overview and
re-translate english doc by LLM (#1687)
9e633f0c9ca is described below
commit 9e633f0c9caaa82a160d63f852d7750ebff119d6
Author: zhannngchen <[email protected]>
AuthorDate: Thu Jan 2 14:18:03 2025 +0800
[doc](update) Address comment on update overview and re-translate english
doc by LLM (#1687)
## Versions
- [ ] dev
- [ ] 3.0
- [ ] 2.1
- [ ] 2.0
## Languages
- [ ] Chinese
- [ ] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
docs/data-operate/update/update-overview.md | 92 ++++++++++-----------
.../current/data-operate/update/update-overview.md | 20 ++---
.../data-operate/update/update-overview.md | 16 ++--
.../data-operate/update/update-overview.md | 20 ++---
.../data-operate/update/update-overview.md | 93 +++++++++++-----------
.../data-operate/update/update-overview.md | 93 +++++++++++-----------
6 files changed, 166 insertions(+), 168 deletions(-)
diff --git a/docs/data-operate/update/update-overview.md
b/docs/data-operate/update/update-overview.md
index 1ae91031919..17c8b37938e 100644
--- a/docs/data-operate/update/update-overview.md
+++ b/docs/data-operate/update/update-overview.md
@@ -23,39 +23,39 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
-Data update refers to modifying the value columns in data records with the
same key. The handling of data updates varies for different data models:
+Data update refers to modifying the value columns in data records with the
same key. The approach to handling data updates varies depending on the data
model:
-- **Primary Key (Unique) Model**: The primary key model is specifically
designed for data updates. Doris supports two storage implementations:
Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance,
while MoW provides better analytical performance. From Doris version 2.1, the
default storage method is MoW. The primary key model supports using the
`UPDATE` statement for small data updates and also supports batch updates
through data loading. Loading methods include [...]
+- **Primary Key (Unique) Model**: This model is designed specifically for data
updates. Doris supports two storage implementations: Merge-on-Read (MoR) and
Merge-on-Write (MoW). MoR optimizes write performance, while MoW enhances
analytical performance. From Doris version 2.1, MoW is the default storage
implementation. The primary key model supports the `UPDATE` statement for small
data updates and batch updates through data loading methods such as Stream
Load, Broker Load, Routine Load, [...]
-- **Aggregate Model**: In the aggregate model, data update is a special use
case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL,
data updates can be achieved. The aggregate model only supports updates based
on data loading and does not support using the `UPDATE` statement. By setting
the aggregate function to REPLACE_IF_NOT_NULL, partial column update capability
can be achieved.
+- **Aggregate Model**: In this model, data updates are a special use case.
When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data
updates can be performed. The aggregate model supports updates only through
data loading and does not support the `UPDATE` statement. By setting the
aggregate function to REPLACE_IF_NOT_NULL, partial column updates can be
achieved.
-By understanding the data update methods of different models, you can better
choose the appropriate update strategy to meet specific business needs.
+Understanding the data update methods for different models helps in selecting
the appropriate update strategy to meet specific business needs.
## Comparison of Update Capabilities for Different Models/Implementations
### Performance Comparison
-| | Unique Key MoW
| Unique Key MoR
[...]
-|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------
[...]
-| Import Speed | Deduplication is performed during import. Small-batch
real-time writes incur approximately 10%-20% performance loss compared to MoR,
while large-batch imports (e.g., tens or hundreds of millions of records) have
about 30%-50% performance loss compared to MoR.
| Similar to Duplicate Key
[...]
-| Query Speed | Similar to Duplicate Key
| Requires deduplication during queries, with query time approximately 3-10
times that of M [...]
-| Predicate Pushdown | Supported
| Not Supported
[...]
-| Resource Consumption | - **Import Resource Consumption**: Consumes
approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br />
- **Query Resource Consumption**: Similar to Duplicate Key with no additional
resource consumption.<br /> - **Compaction Resource Consumption**: Higher
memory and CPU usage compared to Duplicate Key, specific usage depends on data
characteristics and volume. | - **Import Resource Consumption**: Similar to
Duplicate Key with no additional reso [...]
+| | Unique Key MoW
| Unique Key MoR
| Aggregate Key |
+|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| Loading Speed | Deduplication during data loading, small batch real-time
writes have about 10%-20% performance loss compared to MoR, large batch data
loading (e.g., tens of millions/billions of data) have about 30%-50%
performance loss compared to MoR | Similar to Duplicate Key
| Similar to Duplicate Key
|
+| Query Speed | Similar to Duplicate Key
| Deduplication during query, query time
is about 3-10 times that of MoW
| If the aggregate function is REPLACE/REPLACE_IF_NOT_NULL,
query speed is si [...]
+| Predicate Pushdown | Supported
| Not supported
| Not supported
|
+| Resource Consumption | - **Loading Resource Consumption**: Compared to
Duplicate Key/Unique Key MoR, about 10%-30% additional CPU consumption.<br
/><br /> - **Query Resource Consumption**: Similar to Duplicate Key, no
additional resource consumption.<br /><br /> - **Compaction Resource
Consumption**: Consumes more memory and CPU compared to Duplicate Key,
depending on data characteristics and volume. | - **Loading Resource
Consumption**: Similar to Duplicate Key, no additional resource [...]
-### Feature Support Comparison
-| | Unique Key MoW | Unique Key MoR | Aggregate Key
|
-|----------------|----------------|----------------|------------------------------------|
-| UPDATE | Supported | Supported | Not Supported
|
-| DELETE | Supported | Supported | Not Supported
|
-| sequence column| Supported | Supported | Not Supported
|
-| delete_sign | Supported | Supported | Not Supported
|
-| Partial Column Updates | Supported | Not Supported | Supported (can't update
null value) |
-| Inverted Index | Supported | Not Supported | Not Supported
|
+### Supported Feature Comparison
+| | Unique Key MoW | Unique Key MoR | Aggregate Key |
+|----------------|----------------|----------------|----------------|
+| UPDATE | Supported | Supported | Not supported |
+| DELETE | Supported | Supported | Not supported |
+| Sequence Column| Supported | Supported | Not supported |
+| Delete Sign | Supported | Supported | Not supported |
+| Partial Column Update | Supported | Not supported | Supported (but
cannot update null values) |
+| Inverted Index | Supported | Not supported | Not supported |
-## Update in Primary Key (Unique) Model
+## Updates in the Primary Key (Unique) Model
-Starting from Doris 2.0, Doris primary key (unique) model supports both
Merge-on-Read (MoR) and Merge-on-Write (MoW) storage modes. MoR is optimized
for write operations, while MoW is optimized for faster analysis performance.
In actual tests, the analysis performance of MoW storage can be 5-10 times
faster than MoR.
+Starting from Doris 2.0, the primary key (unique) model introduces the
Merge-on-Write (MoW) storage implementation in addition to the original
Merge-on-Read (MoR). MoR is optimized for write performance, while MoW is
optimized for faster analytical performance. In practical tests, the analytical
performance of typical tables using the MoW storage implementation can be 5-10
times that of the MoR implementation.
-By default, in Doris 2.0, the unique key model is still based on MoR. To
create a MoW model, you need to manually specify the parameter
"enable_unique_key_merge_on_write" as "true". Here's an example:
+In Doris 2.0, the default unique model created is still MoR. To create a MoW
table, you need to manually specify it with the parameter
"enable_unique_key_merge_on_write" = "true", as shown in the example below:
```sql
CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write
@@ -78,53 +78,53 @@ PROPERTIES (
```
:::caution
-Starting from Doris 2.1, MoW is the default mode for the unique key model. So,
if you are using Doris 2.1 or a higher version, make sure to read the relevant
table creation documentation.
+From Doris version 2.1, MoW is the default implementation for the primary key
model. Therefore, if using Doris version 2.1 and above, be sure to read the
relevant table creation documentation.
:::
-### Two Update Methods in Unique Key Model
+### Two Update Methods for the Primary Key Model
-#### `UPDATE` statement
+#### Updating with the `UPDATE` Statement
-Whether it is MoR or MoW, the semantics ensure that the specified columns are
updated. The time taken for a single UPDATE operation increases with the amount
of data being updated.
+Whether using MoR or MoW, the semantics are to complete the update of the
specified columns. The time taken for a single UPDATE increases with the amount
of data being updated.
-#### Batch update based on load
+#### Update during Data Loading
-Doris supports multiple data load methods, including Stream Load, Broker Load,
Routine Load, and Insert Into. For primary key tables, all load operations
default to "UPSERT" semantics: if a record with the same key does not exist, it
is inserted; if it already exists, it is updated. There are two types of
updates: full row updates and partial column updates.
+Doris supports multiple data loading methods, including Stream Load, Broker
Load, Routine Load, and Insert Into. For primary key tables, all loading
operations use the "UPSERT" semantics by default: if the record with the same
primary key does not exist, an insert operation is performed; if the record
exists, an update operation is performed. Update methods include whole row
updates and partial column updates:
-- **Full Row Update**: Updates for Unique Key tables default to full row
updates. During data loading, users can choose to provide either all fields or
only part of them. If only partial fields are provided, Doris will fill in the
missing fields with default values to form a complete record for updating.
+- **Whole Row Update**: The default update method for Unique Key tables is
whole row update. When loading data, users can choose to provide all fields or
only some fields. If users only provide some fields, Doris will fill in the
missing fields with default values, generate a complete record, and perform the
update.
-- **Partial Column Update**: Unique Key MoW supports partial column updates.
Users can enable this feature by setting the session variable
`enable_unique_key_partial_update = true` or by specifying
`partial_columns:true` in the HTTP Header. When enabled, if a record with the
given primary key exists, only the specified fields are updated; if the primary
key does not exist, the missing fields are filled with default values.
+- **Partial Column Update**: Unique Key MoW supports partial column updates.
Users can enable this feature by setting the session variable
`enable_unique_key_partial_update = true` or specifying `partial_columns:true`
in the HTTP Header. Once enabled, if the primary key of the loaded data already
exists, only the specified fields will be updated; if the primary key does not
exist, the missing fields will be filled with default values.
-We will provide detailed explanations of these two update methods in the
documentation: [Update in Unique Key Model](../update/unique-update) and [Load
Update in Unique Key Model](../update/update-of-unique-model).
+We will introduce the two update methods in detail in the documents [Update of
Primary Key Model](../update/unique-update) and [Load Update of Primary Key
Model](../update/update-of-unique-model).
-### Concurrency Control for Primary Key Model Updates
+### Concurrency Control for Updates in the Primary Key Model
-#### Using `UPDATE` Statements to Update Data
+#### Updating Data with the `UPDATE` Statement
-By default, Doris does not allow multiple `UPDATE` operations on the same
table at the same time. The `UPDATE` statement ensures isolation through
table-level locking.
+By default, Doris does not allow multiple `UPDATE` operations on the same
table at the same time. The `UPDATE` statement ensures isolation through
table-level locks.
Users can adjust the concurrency limit by modifying the FE configuration
`enable_concurrent_update=true`. When the concurrency limit is relaxed, if
multiple `UPDATE` statements update the same row of data, the result will be
undefined.
-#### Batch Updates Based on Load
+#### Batch Updates Based on Data Loading
-Doris provides atomicity for all load update operations—each data load will
either be fully applied or fully rolled back.
+Doris provides atomicity guarantees for all loading update operations, meaning
that each load job is either fully applied or fully rolled back.
-For concurrent load updates, Doris determines the order of concurrent updates
using an internal version control system (assigned based on the order of
completed loading), using an MVCC mechanism.
+For concurrent load updates, Doris uses internal version control (assigned in
the order of load completion) and the MVCC mechanism to determine the order of
concurrent updates.
-Since the commit order of multiple concurrent load updates may be
unpredictable, if these concurrent load jobs involve updates to the same
primary key, the order in which they take effect is also uncertain. As a
result, the final visible outcome may be indeterminate. To address this issue,
Doris provides a `sequence` column mechanism, allowing users to specify a
version for each row in concurrent load updates, thus ensuring determinism in
the outcome of concurrent updates.
+Since the submission order of multiple concurrent loading may be
unpredictable, if these concurrent loading involve updates to the same primary
key, the effective order will also be unpredictable, resulting in uncertainty
in the final visible result. To solve this problem, Doris provides a sequence
column mechanism, allowing users to specify a version for each row of data
during concurrent load updates to clearly control the order of concurrent
updates and achieve determinism.
-For more detailed information on concurrency control, refer to the
documentation on [Concurrency Control for Updates in the Primary Key
Model](../update/unique-update-concurrent-control.md).
+We will introduce the concurrency control mechanism for updates in detail in
the document [Concurrency Control for Updates in the Primary Key
Model](../update/unique-update-concurrent-control.md).
-## Update in Aggregate Model
+## Updates in the Aggregate Model
-The update in the aggregate model refers to the process of generating new
aggregate values by combining new column values with existing aggregate values,
according to the requirements of the aggregate functions.
+Updates in the aggregate model mainly refer to producing new aggregate values
using new column values and old aggregate values according to the requirements
of the aggregate function.
New Agg Value = Agg Func (Old Agg Value, New Column Value)
-The update in the aggregate model is only supported through load methods and
does not support the use of Update statements. When defining a table in the
aggregate model, if the aggregation function for the value column is defined as
REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities
similar to the unique key model. For more details, please refer to the
documentation on [Load Update in the Aggregate
Model](../update/update-of-aggregate-model).
+The aggregate model only supports updates based on load methods and does not
support updates using the `UPDATE` statement. When defining an aggregate model
table, if the aggregate function of the value column is defined as
REPLACE_IF_NOT_NULL, it can indirectly achieve the ability to update partial
columns similar to the primary key table. For more information, please refer to
[Load Update of Aggregate Model](../update/update-of-aggregate-model).
## Recommendations for Choosing Between Primary Key and Aggregate Models
-- For most scenarios that require data updates, it is recommended to **prefer
the primary key model**. Examples include synchronizing from TP databases to
Doris via CDC, user profiling, and audience targeting.
-- The following scenarios are recommended to use the aggregate model:
- 1. Some fields need to be aggregated as metrics, while others need to be
updated.
- 2. Scenarios where partial column updates are needed, while being very
sensitive to write performance and having low requirements for query latency,
it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL
aggregate function.
\ No newline at end of file
+- For most scenarios with data update requirements, it is recommended to
**prefer the primary key model**. For example, synchronizing from TP databases
to Doris, user profiling, audience selection, etc.
+- The following two scenarios are recommended to use the aggregate model:
+ - Some fields need to be aggregated as metrics, and some fields need to be
updated.
+ - For scenarios with partial column update requirements, high sensitivity to
write performance, and low query latency requirements, it is recommended to use
the aggregate table + REPLACE_IF_NOT_NULL aggregate function.
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md
index eec0acac953..6d6f215d268 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md
@@ -26,7 +26,7 @@ under the License.
数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同:
-- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和
Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为
MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker
Load、Routine Load 和 Insert Into
等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。
+- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和
Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为
MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker
Load、Routine Load 和 Insert Into 等,所有导入操作都遵循 “UPSERT”
语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。
- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或
REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为
REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。
@@ -40,7 +40,7 @@ under the License.
| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比
MoR 约有 30%-50% 的性能损失
| 与 Duplicate Key 接近
| 与 Duplicate
Key 接近 |
| 查询速度 | 与 Duplicate Key 接近
| 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍
| 如果聚合函数为
REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 |
| 谓词下推 | 支持
| 不支持
| 不支持 |
-| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的
CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction
资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate
Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的
CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和
CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 |
+| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的
CPU。<br /><br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /><br /> -
**Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | -
**导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /><br /> - **查询资源消耗**:相比 Duplicate
Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /><br /> - **Compaction 资源消耗**:相比
Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 |
### 功能支持对比
| | Unique Key MoW | Unique Key MoR | Aggregate Key |
@@ -49,7 +49,7 @@ under the License.
| DELETE | 支持 | 支持 | 不支持 |
| sequence 列 | 支持 | 支持 | 不支持 |
| delete_sign | 支持 | 支持 | 不支持 |
-| 部分列更新 | 支持 | 不支持 | 支持(但无法更新 null 值) |
+| 部分列更新 | 支持 | 不支持 | 支持 (但无法更新 null 值) |
| 倒排索引 | 支持 | 不支持 | 不支持 |
## 主键(Unique)模型的更新
@@ -62,7 +62,7 @@ Doris 主键 (unique) 模型,从 Doris 2.0 开始,除了原来的 Merge-on-R
CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write
(
`user_id` LARGEINT NOT NULL,
- `username` VARCHAR(50) NOT NULL ,
+ `username` VARCHAR(50) NOT NULL,
`city` VARCHAR(20),
`age` SMALLINT,
`sex` TINYINT,
@@ -79,7 +79,7 @@ PROPERTIES (
```
:::caution
-从 Doris 2.1 版本开始,写时合并是主键模型的默认方式。所以如果使用 Doris 2.1 及以上版本,务必要阅读相关建表文档。
+从 Doris 2.1 版本开始,MoW是主键模型的默认方式。所以如果使用 Doris 2.1 及以上版本,务必要阅读相关建表文档。
:::
### 主键模型的两种更新方式
@@ -92,7 +92,7 @@ PROPERTIES (
Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routine Load 以及 Insert Into
等。对于主键表,所有导入操作默认采用“UPSERT”语义:当相同主键的记录不存在时执行插入操作,若记录已存在则进行更新操作。更新方式包括整行更新和部分列更新:
-- **整行更新**:Unique Key
表的更新默认为整行更新。在导入数据时,用户可以选择提供所有字段,或仅提供部分字段。当用户只提供部分字段时,Doris
会用默认值填充缺失的字段,生成完整记录并进行更新。
+- **整行更新**:Unique Key
表的更新**默认为整行更新**。在导入数据时,用户可以选择提供所有字段,或仅提供部分字段。当用户只提供部分字段时,Doris
会用默认值填充缺失的字段,生成完整记录并进行更新。
- **部分列更新**:Unique Key MoW 支持部分列更新。用户可以通过设置会话变量
`enable_unique_key_partial_update = true` 或在 HTTP Header 中指定
`partial_columns:true` 来启用此功能。开启后,若导入数据的主键已存在,则仅更新指定的部分字段;若主键不存在,则使用默认值填充缺失字段。
@@ -120,12 +120,12 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据
聚合模型的更新,主要是指用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。
-New Agg Value = Agg Func ( Old Agg Value, New Column Value)
+New Agg Value = Agg Func (Old Agg Value, New Column Value)
-聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为
REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看
[聚合模型的导入更新](../update/update-of-aggregate-model)。
+聚合模型只支持基于导入方式的更新,不支持使用 `UPDATE` 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为
REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看
[聚合模型的导入更新](../update/update-of-aggregate-model)。
## 主键模型和聚合模型的选择建议
- 大部分有数据更新需求的场景,都建议**首选主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。
- 下面两类场景,建议使用聚合模型:
- 1. 部分字段需要做指标聚合,部分字段需要进行更新。
- 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。
\ No newline at end of file
+ - 部分字段需要做指标聚合,部分字段需要进行更新。
+ - 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md
index 768a6a78472..d5dab0c6172 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md
@@ -26,7 +26,7 @@ under the License.
数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同:
-- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和
Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为
MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker
Load、Routine Load 和 Insert Into
等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。
+- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和
Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为
MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker
Load、Routine Load 和 Insert Into 等,所有导入操作都遵循 “UPSERT”
语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。
- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或
REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为
REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。
@@ -40,7 +40,7 @@ under the License.
| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比
MoR 约有 30%-50% 的性能损失
| 与 Duplicate Key 接近
| 与 Duplicate
Key 接近 |
| 查询速度 | 与 Duplicate Key 接近
| 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍
| 如果聚合函数为
REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 |
| 谓词下推 | 支持
| 不支持
| 不支持 |
-| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的
CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction
资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate
Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的
CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和
CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 |
+| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的
CPU。<br /><br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /><br /> -
**Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | -
**导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /><br /> - **查询资源消耗**:相比 Duplicate
Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /><br /> - **Compaction 资源消耗**:相比
Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 |
### 功能支持对比
| | Unique Key MoW | Unique Key MoR | Aggregate Key |
@@ -79,7 +79,7 @@ PROPERTIES (
```
:::caution
-从 Doris 2.1 版本开始,写时合并是主键模型的默认方式。所以如果使用 Doris 2.1 及以上版本,务必要阅读相关建表文档。
+从 Doris 2.1 版本开始,MoW是主键模型的默认方式。所以如果使用 Doris 2.1 及以上版本,务必要阅读相关建表文档。
:::
### 主键模型的两种更新方式
@@ -92,7 +92,7 @@ PROPERTIES (
Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routine Load 以及 Insert Into
等。对于主键表,所有导入操作默认采用“UPSERT”语义:当相同主键的记录不存在时执行插入操作,若记录已存在则进行更新操作。更新方式包括整行更新和部分列更新:
-- **整行更新**:Unique Key
表的更新默认为整行更新。在导入数据时,用户可以选择提供所有字段,或仅提供部分字段。当用户只提供部分字段时,Doris
会用默认值填充缺失的字段,生成完整记录并进行更新。
+- **整行更新**:Unique Key
表的更新**默认为整行更新**。在导入数据时,用户可以选择提供所有字段,或仅提供部分字段。当用户只提供部分字段时,Doris
会用默认值填充缺失的字段,生成完整记录并进行更新。
- **部分列更新**:Unique Key MoW 支持部分列更新。用户可以通过设置会话变量
`enable_unique_key_partial_update = true` 或在 HTTP Header 中指定
`partial_columns:true` 来启用此功能。开启后,若导入数据的主键已存在,则仅更新指定的部分字段;若主键不存在,则使用默认值填充缺失字段。
@@ -114,7 +114,7 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据
由于多个并发导入更新的提交顺序可能无法预期,若这些并发导入涉及相同主键的更新,则其生效顺序也无法预知,最终的可见结果会因此存在不确定性。为解决此问题,Doris
提供了 sequence 列机制,允许用户在并发导入更新时为每一行数据指定版本,以便明确控制并发更新的结果顺序,实现确定性。
-我们将在文档 [主键模型的更新并发控制](../update/unique-update-concurrent-control.md)
中对事务机制进行详细介绍。
+我们将在文档 [主键模型的更新并发控制](../update/unique-update-concurrent-control.md)
中对更新的并发控制机制进行详细介绍。
## 聚合(Aggregate)模型的更新
@@ -125,7 +125,7 @@ New Agg Value = Agg Func (Old Agg Value, New Column Value)
聚合模型只支持基于导入方式的更新,不支持使用 `UPDATE` 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为
REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看
[聚合模型的导入更新](../update/update-of-aggregate-model)。
## 主键模型和聚合模型的选择建议
-- 大部分有数据更新需求的场景,都建议首选**主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。
+- 大部分有数据更新需求的场景,都建议**首选主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。
- 下面两类场景,建议使用聚合模型:
- 1. 部分字段需要做指标聚合,部分字段需要进行更新。
- 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。
\ No newline at end of file
+ - 部分字段需要做指标聚合,部分字段需要进行更新。
+ - 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md
index 0f01e1771d9..937a19e7bd6 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md
@@ -26,7 +26,7 @@ under the License.
数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同:
-- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和
Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为
MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker
Load、Routine Load 和 Insert Into
等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。
+- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和
Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为
MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker
Load、Routine Load 和 Insert Into 等,所有导入操作都遵循 “UPSERT”
语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。
- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或
REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为
REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。
@@ -40,7 +40,7 @@ under the License.
| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比
MoR 约有 30%-50% 的性能损失
| 与 Duplicate Key 接近
| 与 Duplicate
Key 接近 |
| 查询速度 | 与 Duplicate Key 接近
| 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍
| 如果聚合函数为
REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 |
| 谓词下推 | 支持
| 不支持
| 不支持 |
-| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的
CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction
资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate
Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的
CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和
CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 |
+| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的
CPU。<br /><br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /><br /> -
**Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | -
**导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /><br /> - **查询资源消耗**:相比 Duplicate
Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /><br /> - **Compaction 资源消耗**:相比
Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 |
### 功能支持对比
| | Unique Key MoW | Unique Key MoR | Aggregate Key |
@@ -62,7 +62,7 @@ Doris 主键(unique)模型,从 Doris 2.0 开始,除了原来的 Merge-on
CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write
(
`user_id` LARGEINT NOT NULL,
- `username` VARCHAR(50) NOT NULL ,
+ `username` VARCHAR(50) NOT NULL,
`city` VARCHAR(20),
`age` SMALLINT,
`sex` TINYINT,
@@ -79,20 +79,20 @@ PROPERTIES (
```
:::caution
-从 Doris 2.1 版本开始,写时合并是主键模型的默认方式。所以如果使用 Doris 2.1 及以上版本,务必要阅读相关建表文档。
+从 Doris 2.1 版本开始,MoW是主键模型的默认方式。所以如果使用 Doris 2.1 及以上版本,务必要阅读相关建表文档。
:::
### 主键模型的两种更新方式
#### 使用 `UPDATE` 语句更新
-无论是 MoR 还是 MoW,语义都是完成对指定列的更新。这个适合少量数据,不频繁的更新。
+无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次 UPDATE 的耗时会随着被更新的数据量的增加而增长。
#### 基于导入的批量更新
Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routine Load 以及 Insert Into
等。对于主键表,所有导入操作默认采用“UPSERT”语义:当相同主键的记录不存在时执行插入操作,若记录已存在则进行更新操作。更新方式包括整行更新和部分列更新:
-- **整行更新**:Unique Key
表的更新默认为整行更新。在导入数据时,用户可以选择提供所有字段,或仅提供部分字段。当用户只提供部分字段时,Doris
会用默认值填充缺失的字段,生成完整记录并进行更新。
+- **整行更新**:Unique Key
表的更新**默认为整行更新**。在导入数据时,用户可以选择提供所有字段,或仅提供部分字段。当用户只提供部分字段时,Doris
会用默认值填充缺失的字段,生成完整记录并进行更新。
- **部分列更新**:Unique Key MoW 支持部分列更新。用户可以通过设置会话变量
`enable_unique_key_partial_update = true` 或在 HTTP Header 中指定
`partial_columns:true` 来启用此功能。开启后,若导入数据的主键已存在,则仅更新指定的部分字段;若主键不存在,则使用默认值填充缺失字段。
@@ -122,10 +122,10 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据
New Agg Value = Agg Func (Old Agg Value, New Column Value)
-聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为
REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看
[聚合模型的导入更新](../update/update-of-aggregate-model)。
+聚合模型只支持基于导入方式的更新,不支持使用 `UPDATE` 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为
REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看
[聚合模型的导入更新](../update/update-of-aggregate-model)。
## 主键模型和聚合模型的选择建议
-- 大部分有数据更新需求的场景,都建议首选**主键模型**。例如从TP数据库CDC同步到Doris,用户画像,人群圈选等。
+- 大部分有数据更新需求的场景,都建议**首选主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。
- 下面两类场景,建议使用聚合模型:
- 1. 部分字段需要做指标聚合,部分字段需要进行更新。
- 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL聚合函数。
\ No newline at end of file
+ - 部分字段需要做指标聚合,部分字段需要进行更新。
+ - 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。
\ No newline at end of file
diff --git a/versioned_docs/version-2.1/data-operate/update/update-overview.md
b/versioned_docs/version-2.1/data-operate/update/update-overview.md
index 0d6579c83e6..17c8b37938e 100644
--- a/versioned_docs/version-2.1/data-operate/update/update-overview.md
+++ b/versioned_docs/version-2.1/data-operate/update/update-overview.md
@@ -23,40 +23,39 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
+Data update refers to modifying the value columns in data records with the
same key. The approach to handling data updates varies depending on the data
model:
-Data update refers to modifying the value columns in data records with the
same key. The handling of data updates varies for different data models:
+- **Primary Key (Unique) Model**: This model is designed specifically for data
updates. Doris supports two storage implementations: Merge-on-Read (MoR) and
Merge-on-Write (MoW). MoR optimizes write performance, while MoW enhances
analytical performance. From Doris version 2.1, MoW is the default storage
implementation. The primary key model supports the `UPDATE` statement for small
data updates and batch updates through data loading methods such as Stream
Load, Broker Load, Routine Load, [...]
-- **Primary Key (Unique) Model**: The primary key model is specifically
designed for data updates. Doris supports two storage implementations:
Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance,
while MoW provides better analytical performance. From Doris version 2.1, the
default storage method is MoW. The primary key model supports using the
`UPDATE` statement for small data updates and also supports batch updates
through data loading. Loading methods include [...]
+- **Aggregate Model**: In this model, data updates are a special use case.
When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data
updates can be performed. The aggregate model supports updates only through
data loading and does not support the `UPDATE` statement. By setting the
aggregate function to REPLACE_IF_NOT_NULL, partial column updates can be
achieved.
-- **Aggregate Model**: In the aggregate model, data update is a special use
case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL,
data updates can be achieved. The aggregate model only supports updates based
on data loading and does not support using the `UPDATE` statement. By setting
the aggregate function to REPLACE_IF_NOT_NULL, partial column update capability
can be achieved.
-
-By understanding the data update methods of different models, you can better
choose the appropriate update strategy to meet specific business needs.
+Understanding the data update methods for different models helps in selecting
the appropriate update strategy to meet specific business needs.
## Comparison of Update Capabilities for Different Models/Implementations
### Performance Comparison
-| | Unique Key MoW
| Unique Key MoR
[...]
-|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------
[...]
-| Import Speed | Deduplication is performed during import. Small-batch
real-time writes incur approximately 10%-20% performance loss compared to MoR,
while large-batch imports (e.g., tens or hundreds of millions of records) have
about 30%-50% performance loss compared to MoR.
| Similar to Duplicate Key
[...]
-| Query Speed | Similar to Duplicate Key
| Requires deduplication during queries, with query time approximately 3-10
times that of M [...]
-| Predicate Pushdown | Supported
| Not Supported
[...]
-| Resource Consumption | - **Import Resource Consumption**: Consumes
approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br />
- **Query Resource Consumption**: Similar to Duplicate Key with no additional
resource consumption.<br /> - **Compaction Resource Consumption**: Higher
memory and CPU usage compared to Duplicate Key, specific usage depends on data
characteristics and volume. | - **Import Resource Consumption**: Similar to
Duplicate Key with no additional reso [...]
+| | Unique Key MoW
| Unique Key MoR
| Aggregate Key |
+|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| Loading Speed | Deduplication during data loading, small batch real-time
writes have about 10%-20% performance loss compared to MoR, large batch data
loading (e.g., tens of millions/billions of data) have about 30%-50%
performance loss compared to MoR | Similar to Duplicate Key
| Similar to Duplicate Key
|
+| Query Speed | Similar to Duplicate Key
| Deduplication during query, query time
is about 3-10 times that of MoW
| If the aggregate function is REPLACE/REPLACE_IF_NOT_NULL,
query speed is si [...]
+| Predicate Pushdown | Supported
| Not supported
| Not supported
|
+| Resource Consumption | - **Loading Resource Consumption**: Compared to
Duplicate Key/Unique Key MoR, about 10%-30% additional CPU consumption.<br
/><br /> - **Query Resource Consumption**: Similar to Duplicate Key, no
additional resource consumption.<br /><br /> - **Compaction Resource
Consumption**: Consumes more memory and CPU compared to Duplicate Key,
depending on data characteristics and volume. | - **Loading Resource
Consumption**: Similar to Duplicate Key, no additional resource [...]
-### Feature Support Comparison
-| | Unique Key MoW | Unique Key MoR | Aggregate Key
|
-|----------------|----------------|----------------|------------------------------------|
-| UPDATE | Supported | Supported | Not Supported
|
-| DELETE | Supported | Supported | Not Supported
|
-| sequence column| Supported | Supported | Not Supported
|
-| delete_sign | Supported | Supported | Not Supported
|
-| Partial Column Updates | Supported | Not Supported | Supported (can't update
null value) |
-| Inverted Index | Supported | Not Supported | Not Supported
|
+### Supported Feature Comparison
+| | Unique Key MoW | Unique Key MoR | Aggregate Key |
+|----------------|----------------|----------------|----------------|
+| UPDATE | Supported | Supported | Not supported |
+| DELETE | Supported | Supported | Not supported |
+| Sequence Column| Supported | Supported | Not supported |
+| Delete Sign | Supported | Supported | Not supported |
+| Partial Column Update | Supported | Not supported | Supported (but
cannot update null values) |
+| Inverted Index | Supported | Not supported | Not supported |
-## Update in Primary Key (Unique) Model
+## Updates in the Primary Key (Unique) Model
-Starting from Doris 2.0, Doris primary key (unique) model supports both
Merge-on-Read (MoR) and Merge-on-Write (MoW) storage modes. MoR is optimized
for write operations, while MoW is optimized for faster analysis performance.
In actual tests, the analysis performance of MoW storage can be 5-10 times
faster than MoR.
+Starting from Doris 2.0, the primary key (unique) model introduces the
Merge-on-Write (MoW) storage implementation in addition to the original
Merge-on-Read (MoR). MoR is optimized for write performance, while MoW is
optimized for faster analytical performance. In practical tests, the analytical
performance of typical tables using the MoW storage implementation can be 5-10
times that of the MoR implementation.
-By default, in Doris 2.0, the unique key model is still based on MoR. To
create a MoW model, you need to manually specify the parameter
"enable_unique_key_merge_on_write" as "true". Here's an example:
+In Doris 2.0, the default unique model created is still MoR. To create a MoW
table, you need to manually specify it with the parameter
"enable_unique_key_merge_on_write" = "true", as shown in the example below:
```sql
CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write
@@ -79,53 +78,53 @@ PROPERTIES (
```
:::caution
-Starting from Doris 2.1, MoW is the default mode for the unique key model. So,
if you are using Doris 2.1 or higher version, make sure to read the relevant
table creation documentation.
+From Doris version 2.1, MoW is the default implementation for the primary key
model. Therefore, if using Doris version 2.1 and above, be sure to read the
relevant table creation documentation.
:::
-### Two Update Methods in Unique Key Model
+### Two Update Methods for the Primary Key Model
-#### `UPDATE` statement
+#### Updating with the `UPDATE` Statement
-Whether it is MoR or MoW, the semantics ensure that the specified columns are
updated. The time taken for a single UPDATE operation increases with the amount
of data being updated.
+Whether using MoR or MoW, the semantics are to complete the update of the
specified columns. The time taken for a single UPDATE increases with the amount
of data being updated.
-#### Batch update based on load
+#### Update during Data Loading
-Doris supports multiple data load methods, including Stream Load, Broker Load,
Routine Load, and Insert Into. For primary key tables, all load operations
default to "UPSERT" semantics: if a record with the same key does not exist, it
is inserted; if it already exists, it is updated. There are two types of
updates: full row updates and partial column updates.
+Doris supports multiple data loading methods, including Stream Load, Broker
Load, Routine Load, and Insert Into. For primary key tables, all loading
operations use the "UPSERT" semantics by default: if the record with the same
primary key does not exist, an insert operation is performed; if the record
exists, an update operation is performed. Update methods include whole row
updates and partial column updates:
-- **Full Row Update**: Updates for Unique Key tables default to full row
updates. During data loading, users can choose to provide either all fields or
only part of them. If only partial fields are provided, Doris will fill in the
missing fields with default values to form a complete record for updating.
+- **Whole Row Update**: The default update method for Unique Key tables is
whole row update. When loading data, users can choose to provide all fields or
only some fields. If users only provide some fields, Doris will fill in the
missing fields with default values, generate a complete record, and perform the
update.
-- **Partial Column Update**: Unique Key MoW supports partial column updates.
Users can enable this feature by setting the session variable
`enable_unique_key_partial_update = true` or by specifying
`partial_columns:true` in the HTTP Header. When enabled, if a record with the
given primary key exists, only the specified fields are updated; if the primary
key does not exist, the missing fields are filled with default values.
+- **Partial Column Update**: Unique Key MoW supports partial column updates.
Users can enable this feature by setting the session variable
`enable_unique_key_partial_update = true` or specifying `partial_columns:true`
in the HTTP Header. Once enabled, if the primary key of the loaded data already
exists, only the specified fields will be updated; if the primary key does not
exist, the missing fields will be filled with default values.
-We will provide detailed explanations of these two update methods in the
documentation: [Update in Unique Key Model](../update/unique-update) and [Load
Update in Unique Key Model](../update/update-of-unique-model).
+We will introduce the two update methods in detail in the documents [Update of
Primary Key Model](../update/unique-update) and [Load Update of Primary Key
Model](../update/update-of-unique-model).
-### Concurrency Control for Primary Key Model Updates
+### Concurrency Control for Updates in the Primary Key Model
-#### Using `UPDATE` Statements to Update Data
+#### Updating Data with the `UPDATE` Statement
-By default, Doris does not allow multiple `UPDATE` operations on the same
table at the same time. The `UPDATE` statement ensures isolation through
table-level locking.
+By default, Doris does not allow multiple `UPDATE` operations on the same
table at the same time. The `UPDATE` statement ensures isolation through
table-level locks.
Users can adjust the concurrency limit by modifying the FE configuration
`enable_concurrent_update=true`. When the concurrency limit is relaxed, if
multiple `UPDATE` statements update the same row of data, the result will be
undefined.
-#### Batch Updates Based on Load
+#### Batch Updates Based on Data Loading
-Doris provides atomicity for all load update operations—each data load will
either be fully applied or fully rolled back.
+Doris provides atomicity guarantees for all loading update operations, meaning
that each load job is either fully applied or fully rolled back.
-For concurrent load updates, Doris determines the order of concurrent updates
using an internal version control system (assigned based on the order of
completed loading), using an MVCC mechanism.
+For concurrent load updates, Doris uses internal version control (assigned in
the order of load completion) and the MVCC mechanism to determine the order of
concurrent updates.
-Since the commit order of multiple concurrent load updates may be
unpredictable, if these concurrent load jobs involve updates to the same
primary key, the order in which they take effect is also uncertain. As a
result, the final visible outcome may be indeterminate. To address this issue,
Doris provides a `sequence` column mechanism, allowing users to specify a
version for each row in concurrent load updates, thus ensuring determinism in
the outcome of concurrent updates.
+Since the submission order of multiple concurrent loading may be
unpredictable, if these concurrent loading involve updates to the same primary
key, the effective order will also be unpredictable, resulting in uncertainty
in the final visible result. To solve this problem, Doris provides a sequence
column mechanism, allowing users to specify a version for each row of data
during concurrent load updates to clearly control the order of concurrent
updates and achieve determinism.
-For more detailed information on concurrency control, refer to the
documentation on [Concurrency Control for Updates in the Primary Key
Model](../update/unique-update-concurrent-control.md).
+We will introduce the concurrency control mechanism for updates in detail in
the document [Concurrency Control for Updates in the Primary Key
Model](../update/unique-update-concurrent-control.md).
-## Update in Aggregate Model
+## Updates in the Aggregate Model
-The update in the aggregate model refers to the process of generating new
aggregate values by combining new column values with existing aggregate values,
according to the requirements of the aggregate functions.
+Updates in the aggregate model mainly refer to producing new aggregate values
using new column values and old aggregate values according to the requirements
of the aggregate function.
New Agg Value = Agg Func (Old Agg Value, New Column Value)
-The update in the aggregate model is only supported through load methods and
does not support the use of Update statements. When defining a table in the
aggregate model, if the aggregation function for the value column is defined as
REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities
similar to the unique key model. For more details, please refer to the
documentation on [Load Update in the Aggregate
Model](../update/update-of-aggregate-model).
+The aggregate model only supports updates based on load methods and does not
support updates using the `UPDATE` statement. When defining an aggregate model
table, if the aggregate function of the value column is defined as
REPLACE_IF_NOT_NULL, it can indirectly achieve the ability to update partial
columns similar to the primary key table. For more information, please refer to
[Load Update of Aggregate Model](../update/update-of-aggregate-model).
## Recommendations for Choosing Between Primary Key and Aggregate Models
-- For most scenarios that require data updates, it is recommended to **prefer
the primary key model**. Examples include synchronizing from TP databases to
Doris via CDC, user profiling, and audience targeting.
-- The following scenarios are recommended to use the aggregate model:
- 1. Some fields need to be aggregated as metrics, while others need to be
updated.
- 2. Scenarios where partial column updates are needed, while being very
sensitive to write performance and having low requirements for query latency,
it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL
aggregate function.
\ No newline at end of file
+- For most scenarios with data update requirements, it is recommended to
**prefer the primary key model**. For example, synchronizing from TP databases
to Doris, user profiling, audience selection, etc.
+- The following two scenarios are recommended to use the aggregate model:
+ - Some fields need to be aggregated as metrics, and some fields need to be
updated.
+ - For scenarios with partial column update requirements, high sensitivity to
write performance, and low query latency requirements, it is recommended to use
the aggregate table + REPLACE_IF_NOT_NULL aggregate function.
\ No newline at end of file
diff --git a/versioned_docs/version-3.0/data-operate/update/update-overview.md
b/versioned_docs/version-3.0/data-operate/update/update-overview.md
index 2096efcedb8..17c8b37938e 100644
--- a/versioned_docs/version-3.0/data-operate/update/update-overview.md
+++ b/versioned_docs/version-3.0/data-operate/update/update-overview.md
@@ -23,40 +23,39 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
+Data update refers to modifying the value columns in data records with the
same key. The approach to handling data updates varies depending on the data
model:
-Data update refers to modifying the value columns in data records with the
same key. The handling of data updates varies for different data models:
+- **Primary Key (Unique) Model**: This model is designed specifically for data
updates. Doris supports two storage implementations: Merge-on-Read (MoR) and
Merge-on-Write (MoW). MoR optimizes write performance, while MoW enhances
analytical performance. From Doris version 2.1, MoW is the default storage
implementation. The primary key model supports the `UPDATE` statement for small
data updates and batch updates through data loading methods such as Stream
Load, Broker Load, Routine Load, [...]
-- **Primary Key (Unique) Model**: The primary key model is specifically
designed for data updates. Doris supports two storage implementations:
Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance,
while MoW provides better analytical performance. From Doris version 2.1, the
default storage method is MoW. The primary key model supports using the
`UPDATE` statement for small data updates and also supports batch updates
through data loading. Loading methods include [...]
+- **Aggregate Model**: In this model, data updates are a special use case.
When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data
updates can be performed. The aggregate model supports updates only through
data loading and does not support the `UPDATE` statement. By setting the
aggregate function to REPLACE_IF_NOT_NULL, partial column updates can be
achieved.
-- **Aggregate Model**: In the aggregate model, data update is a special use
case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL,
data updates can be achieved. The aggregate model only supports updates based
on data loading and does not support using the `UPDATE` statement. By setting
the aggregate function to REPLACE_IF_NOT_NULL, partial column update capability
can be achieved.
-
-By understanding the data update methods of different models, you can better
choose the appropriate update strategy to meet specific business needs.
+Understanding the data update methods for different models helps in selecting
the appropriate update strategy to meet specific business needs.
## Comparison of Update Capabilities for Different Models/Implementations
### Performance Comparison
-| | Unique Key MoW
| Unique Key MoR
[...]
-|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------
[...]
-| Import Speed | Deduplication is performed during import. Small-batch
real-time writes incur approximately 10%-20% performance loss compared to MoR,
while large-batch imports (e.g., tens or hundreds of millions of records) have
about 30%-50% performance loss compared to MoR.
| Similar to Duplicate Key
[...]
-| Query Speed | Similar to Duplicate Key
| Requires deduplication during queries, with query time approximately 3-10
times that of M [...]
-| Predicate Pushdown | Supported
| Not Supported
[...]
-| Resource Consumption | - **Import Resource Consumption**: Consumes
approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br />
- **Query Resource Consumption**: Similar to Duplicate Key with no additional
resource consumption.<br /> - **Compaction Resource Consumption**: Higher
memory and CPU usage compared to Duplicate Key, specific usage depends on data
characteristics and volume. | - **Import Resource Consumption**: Similar to
Duplicate Key with no additional reso [...]
+| | Unique Key MoW
| Unique Key MoR
| Aggregate Key |
+|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| Loading Speed | Deduplication during data loading, small batch real-time
writes have about 10%-20% performance loss compared to MoR, large batch data
loading (e.g., tens of millions/billions of data) have about 30%-50%
performance loss compared to MoR | Similar to Duplicate Key
| Similar to Duplicate Key
|
+| Query Speed | Similar to Duplicate Key
| Deduplication during query, query time
is about 3-10 times that of MoW
| If the aggregate function is REPLACE/REPLACE_IF_NOT_NULL,
query speed is si [...]
+| Predicate Pushdown | Supported
| Not supported
| Not supported
|
+| Resource Consumption | - **Loading Resource Consumption**: Compared to
Duplicate Key/Unique Key MoR, about 10%-30% additional CPU consumption.<br
/><br /> - **Query Resource Consumption**: Similar to Duplicate Key, no
additional resource consumption.<br /><br /> - **Compaction Resource
Consumption**: Consumes more memory and CPU compared to Duplicate Key,
depending on data characteristics and volume. | - **Loading Resource
Consumption**: Similar to Duplicate Key, no additional resource [...]
-### Feature Support Comparison
-| | Unique Key MoW | Unique Key MoR | Aggregate Key
|
-|----------------|----------------|----------------|------------------------------------|
-| UPDATE | Supported | Supported | Not Supported
|
-| DELETE | Supported | Supported | Not Supported
|
-| sequence column| Supported | Supported | Not Supported
|
-| delete_sign | Supported | Supported | Not Supported
|
-| Partial Column Updates | Supported | Not Supported | Supported (can't update
null value) |
-| Inverted Index | Supported | Not Supported | Not Supported
|
+### Supported Feature Comparison
+| | Unique Key MoW | Unique Key MoR | Aggregate Key |
+|----------------|----------------|----------------|----------------|
+| UPDATE | Supported | Supported | Not supported |
+| DELETE | Supported | Supported | Not supported |
+| Sequence Column| Supported | Supported | Not supported |
+| Delete Sign | Supported | Supported | Not supported |
+| Partial Column Update | Supported | Not supported | Supported (but
cannot update null values) |
+| Inverted Index | Supported | Not supported | Not supported |
-## Update in Primary Key (Unique) Model
+## Updates in the Primary Key (Unique) Model
-Starting from Doris 2.0, Doris primary key (unique) model supports both
Merge-on-Read (MoR) and Merge-on-Write (MoW) storage modes. MoR is optimized
for write operations, while MoW is optimized for faster analysis performance.
In actual tests, the analysis performance of MoW storage can be 5-10 times
faster than MoR.
+Starting from Doris 2.0, the primary key (unique) model introduces the
Merge-on-Write (MoW) storage implementation in addition to the original
Merge-on-Read (MoR). MoR is optimized for write performance, while MoW is
optimized for faster analytical performance. In practical tests, the analytical
performance of typical tables using the MoW storage implementation can be 5-10
times that of the MoR implementation.
-By default, in Doris 2.0, the unique key model is still based on MoR. To
create a MoW model, you need to manually specify the parameter
"enable_unique_key_merge_on_write" as "true". Here's an example:
+In Doris 2.0, the default unique model created is still MoR. To create a MoW
table, you need to manually specify it with the parameter
"enable_unique_key_merge_on_write" = "true", as shown in the example below:
```sql
CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write
@@ -79,53 +78,53 @@ PROPERTIES (
```
:::caution
-Starting from Doris 2.1, MoW is the default mode for the unique key model. So,
if you are using Doris 2.1 or a higher version, make sure to read the relevant
table creation documentation.
+From Doris version 2.1, MoW is the default implementation for the primary key
model. Therefore, if using Doris version 2.1 and above, be sure to read the
relevant table creation documentation.
:::
-### Two Update Methods in Unique Key Model
+### Two Update Methods for the Primary Key Model
-#### `UPDATE` statement
+#### Updating with the `UPDATE` Statement
-Whether it is MoR or MoW, the semantics ensure that the specified columns are
updated. The time taken for a single UPDATE operation increases with the amount
of data being updated.
+Whether using MoR or MoW, the semantics are to complete the update of the
specified columns. The time taken for a single UPDATE increases with the amount
of data being updated.
-#### Batch update based on load
+#### Update during Data Loading
-Doris supports multiple data load methods, including Stream Load, Broker Load,
Routine Load, and Insert Into. For primary key tables, all load operations
default to "UPSERT" semantics: if a record with the same key does not exist, it
is inserted; if it already exists, it is updated. There are two types of
updates: full row updates and partial column updates.
+Doris supports multiple data loading methods, including Stream Load, Broker
Load, Routine Load, and Insert Into. For primary key tables, all loading
operations use the "UPSERT" semantics by default: if the record with the same
primary key does not exist, an insert operation is performed; if the record
exists, an update operation is performed. Update methods include whole row
updates and partial column updates:
-- **Full Row Update**: Updates for Unique Key tables default to full row
updates. During data loading, users can choose to provide either all fields or
only part of them. If only partial fields are provided, Doris will fill in the
missing fields with default values to form a complete record for updating.
+- **Whole Row Update**: The default update method for Unique Key tables is
whole row update. When loading data, users can choose to provide all fields or
only some fields. If users only provide some fields, Doris will fill in the
missing fields with default values, generate a complete record, and perform the
update.
-- **Partial Column Update**: Unique Key MoW supports partial column updates.
Users can enable this feature by setting the session variable
`enable_unique_key_partial_update = true` or by specifying
`partial_columns:true` in the HTTP Header. When enabled, if a record with the
given primary key exists, only the specified fields are updated; if the primary
key does not exist, the missing fields are filled with default values.
+- **Partial Column Update**: Unique Key MoW supports partial column updates.
Users can enable this feature by setting the session variable
`enable_unique_key_partial_update = true` or specifying `partial_columns:true`
in the HTTP Header. Once enabled, if the primary key of the loaded data already
exists, only the specified fields will be updated; if the primary key does not
exist, the missing fields will be filled with default values.
-We will provide detailed explanations of these two update methods in the
documentation: [Update in Unique Key Model](../update/unique-update) and [Load
Update in Unique Key Model](../update/update-of-unique-model).
+We will introduce the two update methods in detail in the documents [Update of
Primary Key Model](../update/unique-update) and [Load Update of Primary Key
Model](../update/update-of-unique-model).
-### Concurrency Control for Primary Key Model Updates
+### Concurrency Control for Updates in the Primary Key Model
-#### Using `UPDATE` Statements to Update Data
+#### Updating Data with the `UPDATE` Statement
-By default, Doris does not allow multiple `UPDATE` operations on the same
table at the same time. The `UPDATE` statement ensures isolation through
table-level locking.
+By default, Doris does not allow multiple `UPDATE` operations on the same
table at the same time. The `UPDATE` statement ensures isolation through
table-level locks.
Users can adjust the concurrency limit by modifying the FE configuration
`enable_concurrent_update=true`. When the concurrency limit is relaxed, if
multiple `UPDATE` statements update the same row of data, the result will be
undefined.
-#### Batch Updates Based on Load
+#### Batch Updates Based on Data Loading
-Doris provides atomicity for all load update operations—each data load will
either be fully applied or fully rolled back.
+Doris provides atomicity guarantees for all loading update operations, meaning
that each load job is either fully applied or fully rolled back.
-For concurrent load updates, Doris determines the order of concurrent updates
using an internal version control system (assigned based on the order of
completed loading), using an MVCC mechanism.
+For concurrent load updates, Doris uses internal version control (assigned in
the order of load completion) and the MVCC mechanism to determine the order of
concurrent updates.
-Since the commit order of multiple concurrent load updates may be
unpredictable, if these concurrent load jobs involve updates to the same
primary key, the order in which they take effect is also uncertain. As a
result, the final visible outcome may be indeterminate. To address this issue,
Doris provides a `sequence` column mechanism, allowing users to specify a
version for each row in concurrent load updates, thus ensuring determinism in
the outcome of concurrent updates.
+Since the submission order of multiple concurrent loading may be
unpredictable, if these concurrent loading involve updates to the same primary
key, the effective order will also be unpredictable, resulting in uncertainty
in the final visible result. To solve this problem, Doris provides a sequence
column mechanism, allowing users to specify a version for each row of data
during concurrent load updates to clearly control the order of concurrent
updates and achieve determinism.
-For more detailed information on concurrency control, refer to the
documentation on [Concurrency Control for Updates in the Primary Key
Model](../update/unique-update-concurrent-control.md).
+We will introduce the concurrency control mechanism for updates in detail in
the document [Concurrency Control for Updates in the Primary Key
Model](../update/unique-update-concurrent-control.md).
-## Update in Aggregate Model
+## Updates in the Aggregate Model
-The update in the aggregate model refers to the process of generating new
aggregate values by combining new column values with existing aggregate values,
according to the requirements of the aggregate functions.
+Updates in the aggregate model mainly refer to producing new aggregate values
using new column values and old aggregate values according to the requirements
of the aggregate function.
New Agg Value = Agg Func (Old Agg Value, New Column Value)
-The update in the aggregate model is only supported through load methods and
does not support the use of Update statements. When defining a table in the
aggregate model, if the aggregation function for the value column is defined as
REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities
similar to the unique key model. For more details, please refer to the
documentation on [Load Update in the Aggregate
Model](../update/update-of-aggregate-model).
+The aggregate model only supports updates based on load methods and does not
support updates using the `UPDATE` statement. When defining an aggregate model
table, if the aggregate function of the value column is defined as
REPLACE_IF_NOT_NULL, it can indirectly achieve the ability to update partial
columns similar to the primary key table. For more information, please refer to
[Load Update of Aggregate Model](../update/update-of-aggregate-model).
## Recommendations for Choosing Between Primary Key and Aggregate Models
-- For most scenarios that require data updates, it is recommended to **prefer
the primary key model**. Examples include synchronizing from TP databases to
Doris via CDC, user profiling, and audience targeting.
-- The following scenarios are recommended to use the aggregate model:
- 1. Some fields need to be aggregated as metrics, while others need to be
updated.
- 2. Scenarios where partial column updates are needed, while being very
sensitive to write performance and having low requirements for query latency,
it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL
aggregate function.
\ No newline at end of file
+- For most scenarios with data update requirements, it is recommended to
**prefer the primary key model**. For example, synchronizing from TP databases
to Doris, user profiling, audience selection, etc.
+- The following two scenarios are recommended to use the aggregate model:
+ - Some fields need to be aggregated as metrics, and some fields need to be
updated.
+ - For scenarios with partial column update requirements, high sensitivity to
write performance, and low query latency requirements, it is recommended to use
the aggregate table + REPLACE_IF_NOT_NULL aggregate function.
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]