This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 501bb30c296 [doc][doc] small fixes (#929)
501bb30c296 is described below
commit 501bb30c2964192696f7c6cae1e44e71ef25c6ba
Author: Hu Yanjun <[email protected]>
AuthorDate: Tue Jul 30 18:13:26 2024 +0800
[doc][doc] small fixes (#929)
---
docs/get-starting/quick-start/doris-hudi.md | 14 +++++++-------
docs/get-starting/quick-start/doris-paimon.md | 4 ++--
.../current/get-starting/quick-start/doris-paimon.md | 7 +++----
.../version-2.1/get-starting/quick-start/doris-paimon.md | 6 +++---
.../version-3.0/get-starting/quick-start/doris-paimon.md | 6 +++---
.../version-2.1/get-starting/quick-start/doris-hudi.md | 14 +++++++-------
.../version-2.1/get-starting/quick-start/doris-paimon.md | 4 ++--
.../version-3.0/get-starting/quick-start/doris-hudi.md | 14 +++++++-------
.../version-3.0/get-starting/quick-start/doris-paimon.md | 4 ++--
9 files changed, 36 insertions(+), 37 deletions(-)
diff --git a/docs/get-starting/quick-start/doris-hudi.md
b/docs/get-starting/quick-start/doris-hudi.md
index f7426e3a3fc..b640c5e8f92 100644
--- a/docs/get-starting/quick-start/doris-hudi.md
+++ b/docs/get-starting/quick-start/doris-hudi.md
@@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration
with data lakes an
- Since version 0.15, Apache Doris has introduced Hive and Iceberg external
tables, exploring the capabilities of combining with Apache Iceberg for data
lakes.
- Starting from version 1.2, Apache Doris officially introduced the
Multi-Catalog feature, enabling automatic metadata mapping and data access for
various data sources, along with numerous performance optimizations for
external data reading and query execution. It now fully possesses the ability
to build a high-speed and user-friendly Lakehouse architecture.
-- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly
enhanced, improving the reading and writing capabilities of mainstream data
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with
multiple SQL dialects, and seamless migration from existing systems to Apache
Doris. For data science and large-scale data reading scenarios, Doris
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold
increase in data transfer efficiency.
+- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly
enhanced, improving the reading and writing capabilities of mainstream data
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with
multiple SQL dialects, and seamless migration from existing systems to Apache
Doris. For data science and large-scale data reading scenarios, Doris
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold
increase in data transfer efficiency.

@@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache
Hudi data tables:
- Supports Time Travel
- Supports Incremental Read
-With Apache Doris's high-performance query execution and Apache Hudi's
real-time data management capabilities, efficient, flexible, and cost-effective
data querying and analysis can be achieved. It also provides robust data
lineage, auditing, and incremental processing functionalities. The combination
of Apache Doris and Apache Hudi has been validated and promoted in real
business scenarios by multiple community users:
+With Apache Doris' high-performance query execution and Apache Hudi's
real-time data management capabilities, efficient, flexible, and cost-effective
data querying and analysis can be achieved. It also provides robust data
lineage, auditing, and incremental processing functionalities. The combination
of Apache Doris and Apache Hudi has been validated and promoted in real
business scenarios by multiple community users:
- Real-time data analysis and processing: Common scenarios such as real-time
data updates and query analysis in industries like finance, advertising, and
e-commerce require real-time data processing. Hudi enables real-time data
updates and management while ensuring data consistency and reliability. Doris
efficiently handles large-scale data query requests in real-time, meeting the
demands of real-time data analysis and processing effectively when combined.
-- Data lineage and auditing: For industries with high requirements for data
security and accuracy like finance and healthcare, data lineage and auditing
are crucial functionalities. Hudi offers Time Travel functionality for viewing
historical data states, combined with Apache Doris's efficient querying
capabilities, enabling quick analysis of data at any point in time for precise
lineage and auditing.
-- Incremental data reading and analysis: Large-scale data analysis often faces
challenges of large data volumes and frequent updates. Hudi supports
incremental data reading, allowing users to process only the changed data
without full data updates. Additionally, Apache Doris's Incremental Read
feature enhances this process, significantly improving data processing and
analysis efficiency.
-- Cross-data source federated queries: Many enterprises have complex data
sources stored in different databases. Doris's Multi-Catalog feature supports
automatic mapping and synchronization of various data sources, enabling
federated queries across data sources. This greatly shortens the data flow path
and enhances work efficiency for enterprises needing to retrieve and integrate
data from multiple sources for analysis.
+- Data lineage and auditing: For industries with high requirements for data
security and accuracy like finance and healthcare, data lineage and auditing
are crucial functionalities. Hudi offers Time Travel functionality for viewing
historical data states, combined with Apache Doris' efficient querying
capabilities, enabling quick analysis of data at any point in time for precise
lineage and auditing.
+- Incremental data reading and analysis: Large-scale data analysis often faces
challenges of large data volumes and frequent updates. Hudi supports
incremental data reading, allowing users to process only the changed data
without full data updates. Additionally, Apache Doris' Incremental Read feature
enhances this process, significantly improving data processing and analysis
efficiency.
+- Cross-data source federated queries: Many enterprises have complex data
sources stored in different databases. Doris' Multi-Catalog feature supports
automatic mapping and synchronization of various data sources, enabling
federated queries across data sources. This greatly shortens the data flow path
and enhances work efficiency for enterprises needing to retrieve and integrate
data from multiple sources for analysis.
This article will introduce readers to how to quickly set up a test and
demonstration environment for Apache Doris + Apache Hudi in a Docker
environment, and demonstrate various operations to help readers get started
quickly.
@@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of
'20240603015058442' where
Data in Apache Hudi can be roughly divided into two categories - baseline data
and incremental data. Baseline data is typically merged Parquet files, while
incremental data refers to data increments generated by INSERT, UPDATE, or
DELETE operations. Baseline data can be read directly, while incremental data
needs to be read through Merge on Read.
-For querying Hudi COW tables or Read Optimized queries on MOR tables, the data
belongs to baseline data and can be directly read using Doris's native Parquet
Reader, providing fast query responses. For incremental data, Doris needs to
access Hudi's Java SDK through JNI calls. To achieve optimal query performance,
Apache Doris divides the data in a query into baseline and incremental data
parts and reads them using the aforementioned methods.
+For querying Hudi COW tables or Read Optimized queries on MOR tables, the data
belongs to baseline data and can be directly read using Doris' native Parquet
Reader, providing fast query responses. For incremental data, Doris needs to
access Hudi's Java SDK through JNI calls. To achieve optimal query performance,
Apache Doris divides the data in a query into baseline and incremental data
parts and reads them using the aforementioned methods.
-To verify this optimization approach, we can use the EXPLAIN statement to see
how many baseline and incremental data are present in a query example below.
For a COW table, all 101 data shards are baseline data
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read
directly using Doris's Parquet Reader, resulting in the best query performance.
For a ROW table, most data shards are baseline data
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which
[...]
+To verify this optimization approach, we can use the EXPLAIN statement to see
how many baseline and incremental data are present in a query example below.
For a COW table, all 101 data shards are baseline data
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read
directly using Doris' Parquet Reader, resulting in the best query performance.
For a ROW table, most data shards are baseline data
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which
[...]
```
-- COW table is read natively
diff --git a/docs/get-starting/quick-start/doris-paimon.md
b/docs/get-starting/quick-start/doris-paimon.md
index cf1956c6e2d..27175dfe180 100644
--- a/docs/get-starting/quick-start/doris-paimon.md
+++ b/docs/get-starting/quick-start/doris-paimon.md
@@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in
Paimon (0.8) version, us

-From the test results, it can be seen that Doris's average query performance
on the standard static test set is 3-5 times that of Trino. In the future, we
will optimize the Deletion Vector to further improve query efficiency in real
business scenarios.
+From the test results, it can be seen that Doris' average query performance on
the standard static test set is 3-5 times that of Trino. In the future, we will
optimize the Deletion Vector to further improve query efficiency in real
business scenarios.
## Query Optimization
For baseline data, after introducing the Primary Key Table Read Optimized
feature in Apache Paimon version 0.6, the query engine can directly access the
underlying Parquet/ORC files, significantly improving the reading efficiency of
baseline data. For unmerged incremental data (data increments generated by
INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In
addition, Paimon introduced the Deletion Vector feature in version 0.8, which
further enhances the query engine's [...]
-Apache Doris supports reading Deletion Vector through native Reader and
performing Merge on Read. We demonstrate the query methods for baseline data
and incremental data in a query using Doris's EXPLAIN statement.
+Apache Doris supports reading Deletion Vector through native Reader and
performing Merge on Read. We demonstrate the query methods for baseline data
and incremental data in a query using Doris' EXPLAIN statement.
```
mysql> explain verbose select * from customer where c_nationkey < 3;
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
index 68d6fba0c90..bdf306d6fa4 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/quick-start/doris-paimon.md
@@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit
4;
### 04 数据查询
-如下所示,Doris 集群中已经创建了名为paimon 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句:
+如下所示,Doris 集群中已经创建了名为 `paimon` 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog
的创建语句:
```
-- 已创建,无需执行
@@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2;

-从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector
进行优化,进一步提升真实业务场景下的查询效率。
+从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector
进行优化,进一步提升真实业务场景下的查询效率。
## 查询优化
@@ -266,5 +266,4 @@ mysql> explain verbose select * from customer where
c_nationkey < 3;
67 rows in set (0.23 sec)
```
-可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader
进行访问(paimonNativeReadSplits=4/4)。并且第一个分片的hasDeletionVector的属性为 true,表示该分片有对应的
Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。
-
+可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader
进行访问(`paimonNativeReadSplits=4/4`)。并且第一个分片的 `hasDeletionVector` 的属性为
`true`,表示该分片有对应的 Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
index c46e7531e19..a1db993d470 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/get-starting/quick-start/doris-paimon.md
@@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit
4;
### 04 数据查询
-如下所示,Doris 集群中已经创建了名为paimon 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句:
+如下所示,Doris 集群中已经创建了名为 `paimon` 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog
的创建语句:
```
-- 已创建,无需执行
@@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2;

-从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector
进行优化,进一步提升真实业务场景下的查询效率。
+从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector
进行优化,进一步提升真实业务场景下的查询效率。
## 查询优化
@@ -266,4 +266,4 @@ mysql> explain verbose select * from customer where
c_nationkey < 3;
67 rows in set (0.23 sec)
```
-可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader
进行访问(paimonNativeReadSplits=4/4)。并且第一个分片的hasDeletionVector的属性为 true,表示该分片有对应的
Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
+可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader
进行访问(`paimonNativeReadSplits=4/4`)。并且第一个分片的`hasDeletionVector`的属性为`true`,表示该分片有对应的
Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
index c46e7531e19..a1db993d470 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/get-starting/quick-start/doris-paimon.md
@@ -156,7 +156,7 @@ Flink SQL> select * from customer order by c_custkey limit
4;
### 04 数据查询
-如下所示,Doris 集群中已经创建了名为paimon 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog 的创建语句:
+如下所示,Doris 集群中已经创建了名为 `paimon` 的 Catalog(可通过 SHOW CATALOGS 查看)。以下为该 Catalog
的创建语句:
```
-- 已创建,无需执行
@@ -228,7 +228,7 @@ mysql> select * from customer where c_nationkey=1 limit 2;

-从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3 -5 倍。后续我们将针对 Deletion Vector
进行优化,进一步提升真实业务场景下的查询效率。
+从测试结果可以看到,Doris 在标准静态测试集上的平均查询性能是 Trino 的 3~5 倍。后续我们将针对 Deletion Vector
进行优化,进一步提升真实业务场景下的查询效率。
## 查询优化
@@ -266,4 +266,4 @@ mysql> explain verbose select * from customer where
c_nationkey < 3;
67 rows in set (0.23 sec)
```
-可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader
进行访问(paimonNativeReadSplits=4/4)。并且第一个分片的hasDeletionVector的属性为 true,表示该分片有对应的
Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
+可以看到,对于刚才通过 Flink SQL 更新的表,包含 4 个分片,并且全部分片都可以通过 Native Reader
进行访问(`paimonNativeReadSplits=4/4`)。并且第一个分片的`hasDeletionVector`的属性为`true`,表示该分片有对应的
Deletion Vector,读取时会根据 Deletion Vector 进行数据过滤。
\ No newline at end of file
diff --git a/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
b/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
index f7426e3a3fc..b640c5e8f92 100644
--- a/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
+++ b/versioned_docs/version-2.1/get-starting/quick-start/doris-hudi.md
@@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration
with data lakes an
- Since version 0.15, Apache Doris has introduced Hive and Iceberg external
tables, exploring the capabilities of combining with Apache Iceberg for data
lakes.
- Starting from version 1.2, Apache Doris officially introduced the
Multi-Catalog feature, enabling automatic metadata mapping and data access for
various data sources, along with numerous performance optimizations for
external data reading and query execution. It now fully possesses the ability
to build a high-speed and user-friendly Lakehouse architecture.
-- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly
enhanced, improving the reading and writing capabilities of mainstream data
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with
multiple SQL dialects, and seamless migration from existing systems to Apache
Doris. For data science and large-scale data reading scenarios, Doris
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold
increase in data transfer efficiency.
+- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly
enhanced, improving the reading and writing capabilities of mainstream data
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with
multiple SQL dialects, and seamless migration from existing systems to Apache
Doris. For data science and large-scale data reading scenarios, Doris
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold
increase in data transfer efficiency.

@@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache
Hudi data tables:
- Supports Time Travel
- Supports Incremental Read
-With Apache Doris's high-performance query execution and Apache Hudi's
real-time data management capabilities, efficient, flexible, and cost-effective
data querying and analysis can be achieved. It also provides robust data
lineage, auditing, and incremental processing functionalities. The combination
of Apache Doris and Apache Hudi has been validated and promoted in real
business scenarios by multiple community users:
+With Apache Doris' high-performance query execution and Apache Hudi's
real-time data management capabilities, efficient, flexible, and cost-effective
data querying and analysis can be achieved. It also provides robust data
lineage, auditing, and incremental processing functionalities. The combination
of Apache Doris and Apache Hudi has been validated and promoted in real
business scenarios by multiple community users:
- Real-time data analysis and processing: Common scenarios such as real-time
data updates and query analysis in industries like finance, advertising, and
e-commerce require real-time data processing. Hudi enables real-time data
updates and management while ensuring data consistency and reliability. Doris
efficiently handles large-scale data query requests in real-time, meeting the
demands of real-time data analysis and processing effectively when combined.
-- Data lineage and auditing: For industries with high requirements for data
security and accuracy like finance and healthcare, data lineage and auditing
are crucial functionalities. Hudi offers Time Travel functionality for viewing
historical data states, combined with Apache Doris's efficient querying
capabilities, enabling quick analysis of data at any point in time for precise
lineage and auditing.
-- Incremental data reading and analysis: Large-scale data analysis often faces
challenges of large data volumes and frequent updates. Hudi supports
incremental data reading, allowing users to process only the changed data
without full data updates. Additionally, Apache Doris's Incremental Read
feature enhances this process, significantly improving data processing and
analysis efficiency.
-- Cross-data source federated queries: Many enterprises have complex data
sources stored in different databases. Doris's Multi-Catalog feature supports
automatic mapping and synchronization of various data sources, enabling
federated queries across data sources. This greatly shortens the data flow path
and enhances work efficiency for enterprises needing to retrieve and integrate
data from multiple sources for analysis.
+- Data lineage and auditing: For industries with high requirements for data
security and accuracy like finance and healthcare, data lineage and auditing
are crucial functionalities. Hudi offers Time Travel functionality for viewing
historical data states, combined with Apache Doris' efficient querying
capabilities, enabling quick analysis of data at any point in time for precise
lineage and auditing.
+- Incremental data reading and analysis: Large-scale data analysis often faces
challenges of large data volumes and frequent updates. Hudi supports
incremental data reading, allowing users to process only the changed data
without full data updates. Additionally, Apache Doris' Incremental Read feature
enhances this process, significantly improving data processing and analysis
efficiency.
+- Cross-data source federated queries: Many enterprises have complex data
sources stored in different databases. Doris' Multi-Catalog feature supports
automatic mapping and synchronization of various data sources, enabling
federated queries across data sources. This greatly shortens the data flow path
and enhances work efficiency for enterprises needing to retrieve and integrate
data from multiple sources for analysis.
This article will introduce readers to how to quickly set up a test and
demonstration environment for Apache Doris + Apache Hudi in a Docker
environment, and demonstrate various operations to help readers get started
quickly.
@@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of
'20240603015058442' where
Data in Apache Hudi can be roughly divided into two categories - baseline data
and incremental data. Baseline data is typically merged Parquet files, while
incremental data refers to data increments generated by INSERT, UPDATE, or
DELETE operations. Baseline data can be read directly, while incremental data
needs to be read through Merge on Read.
-For querying Hudi COW tables or Read Optimized queries on MOR tables, the data
belongs to baseline data and can be directly read using Doris's native Parquet
Reader, providing fast query responses. For incremental data, Doris needs to
access Hudi's Java SDK through JNI calls. To achieve optimal query performance,
Apache Doris divides the data in a query into baseline and incremental data
parts and reads them using the aforementioned methods.
+For querying Hudi COW tables or Read Optimized queries on MOR tables, the data
belongs to baseline data and can be directly read using Doris' native Parquet
Reader, providing fast query responses. For incremental data, Doris needs to
access Hudi's Java SDK through JNI calls. To achieve optimal query performance,
Apache Doris divides the data in a query into baseline and incremental data
parts and reads them using the aforementioned methods.
-To verify this optimization approach, we can use the EXPLAIN statement to see
how many baseline and incremental data are present in a query example below.
For a COW table, all 101 data shards are baseline data
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read
directly using Doris's Parquet Reader, resulting in the best query performance.
For a ROW table, most data shards are baseline data
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which
[...]
+To verify this optimization approach, we can use the EXPLAIN statement to see
how many baseline and incremental data are present in a query example below.
For a COW table, all 101 data shards are baseline data
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read
directly using Doris' Parquet Reader, resulting in the best query performance.
For a ROW table, most data shards are baseline data
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which
[...]
```
-- COW table is read natively
diff --git
a/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
b/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
index cf1956c6e2d..27175dfe180 100644
--- a/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
+++ b/versioned_docs/version-2.1/get-starting/quick-start/doris-paimon.md
@@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in
Paimon (0.8) version, us

-From the test results, it can be seen that Doris's average query performance
on the standard static test set is 3-5 times that of Trino. In the future, we
will optimize the Deletion Vector to further improve query efficiency in real
business scenarios.
+From the test results, it can be seen that Doris' average query performance on
the standard static test set is 3-5 times that of Trino. In the future, we will
optimize the Deletion Vector to further improve query efficiency in real
business scenarios.
## Query Optimization
For baseline data, after introducing the Primary Key Table Read Optimized
feature in Apache Paimon version 0.6, the query engine can directly access the
underlying Parquet/ORC files, significantly improving the reading efficiency of
baseline data. For unmerged incremental data (data increments generated by
INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In
addition, Paimon introduced the Deletion Vector feature in version 0.8, which
further enhances the query engine's [...]
-Apache Doris supports reading Deletion Vector through native Reader and
performing Merge on Read. We demonstrate the query methods for baseline data
and incremental data in a query using Doris's EXPLAIN statement.
+Apache Doris supports reading Deletion Vector through native Reader and
performing Merge on Read. We demonstrate the query methods for baseline data
and incremental data in a query using Doris' EXPLAIN statement.
```
mysql> explain verbose select * from customer where c_nationkey < 3;
diff --git a/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
b/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
index f7426e3a3fc..b640c5e8f92 100644
--- a/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
+++ b/versioned_docs/version-3.0/get-starting/quick-start/doris-hudi.md
@@ -31,7 +31,7 @@ In recent versions, Apache Doris has deepened its integration
with data lakes an
- Since version 0.15, Apache Doris has introduced Hive and Iceberg external
tables, exploring the capabilities of combining with Apache Iceberg for data
lakes.
- Starting from version 1.2, Apache Doris officially introduced the
Multi-Catalog feature, enabling automatic metadata mapping and data access for
various data sources, along with numerous performance optimizations for
external data reading and query execution. It now fully possesses the ability
to build a high-speed and user-friendly Lakehouse architecture.
-- In version 2.1, Apache Doris's Data Lakehouse architecture was significantly
enhanced, improving the reading and writing capabilities of mainstream data
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with
multiple SQL dialects, and seamless migration from existing systems to Apache
Doris. For data science and large-scale data reading scenarios, Doris
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold
increase in data transfer efficiency.
+- In version 2.1, Apache Doris' Data Lakehouse architecture was significantly
enhanced, improving the reading and writing capabilities of mainstream data
lake formats (Hudi, Iceberg, Paimon, etc.), introducing compatibility with
multiple SQL dialects, and seamless migration from existing systems to Apache
Doris. For data science and large-scale data reading scenarios, Doris
integrated the Arrow Flight high-speed reading interface, achieving a 100-fold
increase in data transfer efficiency.

@@ -46,12 +46,12 @@ Apache Doris has also enhanced its ability to read Apache
Hudi data tables:
- Supports Time Travel
- Supports Incremental Read
-With Apache Doris's high-performance query execution and Apache Hudi's
real-time data management capabilities, efficient, flexible, and cost-effective
data querying and analysis can be achieved. It also provides robust data
lineage, auditing, and incremental processing functionalities. The combination
of Apache Doris and Apache Hudi has been validated and promoted in real
business scenarios by multiple community users:
+With Apache Doris' high-performance query execution and Apache Hudi's
real-time data management capabilities, efficient, flexible, and cost-effective
data querying and analysis can be achieved. It also provides robust data
lineage, auditing, and incremental processing functionalities. The combination
of Apache Doris and Apache Hudi has been validated and promoted in real
business scenarios by multiple community users:
- Real-time data analysis and processing: Common scenarios such as real-time
data updates and query analysis in industries like finance, advertising, and
e-commerce require real-time data processing. Hudi enables real-time data
updates and management while ensuring data consistency and reliability. Doris
efficiently handles large-scale data query requests in real-time, meeting the
demands of real-time data analysis and processing effectively when combined.
-- Data lineage and auditing: For industries with high requirements for data
security and accuracy like finance and healthcare, data lineage and auditing
are crucial functionalities. Hudi offers Time Travel functionality for viewing
historical data states, combined with Apache Doris's efficient querying
capabilities, enabling quick analysis of data at any point in time for precise
lineage and auditing.
-- Incremental data reading and analysis: Large-scale data analysis often faces
challenges of large data volumes and frequent updates. Hudi supports
incremental data reading, allowing users to process only the changed data
without full data updates. Additionally, Apache Doris's Incremental Read
feature enhances this process, significantly improving data processing and
analysis efficiency.
-- Cross-data source federated queries: Many enterprises have complex data
sources stored in different databases. Doris's Multi-Catalog feature supports
automatic mapping and synchronization of various data sources, enabling
federated queries across data sources. This greatly shortens the data flow path
and enhances work efficiency for enterprises needing to retrieve and integrate
data from multiple sources for analysis.
+- Data lineage and auditing: For industries with high requirements for data
security and accuracy like finance and healthcare, data lineage and auditing
are crucial functionalities. Hudi offers Time Travel functionality for viewing
historical data states, combined with Apache Doris' efficient querying
capabilities, enabling quick analysis of data at any point in time for precise
lineage and auditing.
+- Incremental data reading and analysis: Large-scale data analysis often faces
challenges of large data volumes and frequent updates. Hudi supports
incremental data reading, allowing users to process only the changed data
without full data updates. Additionally, Apache Doris' Incremental Read feature
enhances this process, significantly improving data processing and analysis
efficiency.
+- Cross-data source federated queries: Many enterprises have complex data
sources stored in different databases. Doris' Multi-Catalog feature supports
automatic mapping and synchronization of various data sources, enabling
federated queries across data sources. This greatly shortens the data flow path
and enhances work efficiency for enterprises needing to retrieve and integrate
data from multiple sources for analysis.
This article will introduce readers to how to quickly set up a test and
demonstration environment for Apache Doris + Apache Hudi in a Docker
environment, and demonstrate various operations to help readers get started
quickly.
@@ -258,9 +258,9 @@ spark-sql> select * from customer_mor timestamp as of
'20240603015058442' where
Data in Apache Hudi can be roughly divided into two categories - baseline data
and incremental data. Baseline data is typically merged Parquet files, while
incremental data refers to data increments generated by INSERT, UPDATE, or
DELETE operations. Baseline data can be read directly, while incremental data
needs to be read through Merge on Read.
-For querying Hudi COW tables or Read Optimized queries on MOR tables, the data
belongs to baseline data and can be directly read using Doris's native Parquet
Reader, providing fast query responses. For incremental data, Doris needs to
access Hudi's Java SDK through JNI calls. To achieve optimal query performance,
Apache Doris divides the data in a query into baseline and incremental data
parts and reads them using the aforementioned methods.
+For querying Hudi COW tables or Read Optimized queries on MOR tables, the data
belongs to baseline data and can be directly read using Doris' native Parquet
Reader, providing fast query responses. For incremental data, Doris needs to
access Hudi's Java SDK through JNI calls. To achieve optimal query performance,
Apache Doris divides the data in a query into baseline and incremental data
parts and reads them using the aforementioned methods.
-To verify this optimization approach, we can use the EXPLAIN statement to see
how many baseline and incremental data are present in a query example below.
For a COW table, all 101 data shards are baseline data
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read
directly using Doris's Parquet Reader, resulting in the best query performance.
For a ROW table, most data shards are baseline data
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which
[...]
+To verify this optimization approach, we can use the EXPLAIN statement to see
how many baseline and incremental data are present in a query example below.
For a COW table, all 101 data shards are baseline data
(`hudiNativeReadSplits=101/101`), so the COW table can be entirely read
directly using Doris' Parquet Reader, resulting in the best query performance.
For a ROW table, most data shards are baseline data
(`hudiNativeReadSplits=100/101`), with one shard being incremental data, which
[...]
```
-- COW table is read natively
diff --git
a/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
b/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
index cf1956c6e2d..27175dfe180 100644
--- a/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
+++ b/versioned_docs/version-3.0/get-starting/quick-start/doris-paimon.md
@@ -228,12 +228,12 @@ We conducted a simple test on the TPCDS 1000 dataset in
Paimon (0.8) version, us

-From the test results, it can be seen that Doris's average query performance
on the standard static test set is 3-5 times that of Trino. In the future, we
will optimize the Deletion Vector to further improve query efficiency in real
business scenarios.
+From the test results, it can be seen that Doris' average query performance on
the standard static test set is 3-5 times that of Trino. In the future, we will
optimize the Deletion Vector to further improve query efficiency in real
business scenarios.
## Query Optimization
For baseline data, after introducing the Primary Key Table Read Optimized
feature in Apache Paimon version 0.6, the query engine can directly access the
underlying Parquet/ORC files, significantly improving the reading efficiency of
baseline data. For unmerged incremental data (data increments generated by
INSERT, UPDATE, or DELETE), they can be read through Merge-on-Read. In
addition, Paimon introduced the Deletion Vector feature in version 0.8, which
further enhances the query engine's [...]
-Apache Doris supports reading Deletion Vector through native Reader and
performing Merge on Read. We demonstrate the query methods for baseline data
and incremental data in a query using Doris's EXPLAIN statement.
+Apache Doris supports reading Deletion Vector through native Reader and
performing Merge on Read. We demonstrate the query methods for baseline data
and incremental data in a query using Doris' EXPLAIN statement.
```
mysql> explain verbose select * from customer where c_nationkey < 3;
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]