This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 343ad697c33 [doc] (doris catalog) doris catalog supports a new cross
cluster query method (#3138)
343ad697c33 is described below
commit 343ad697c330de38e5a61eb3cb5789e06acdc521
Author: HonestManXin <[email protected]>
AuthorDate: Fri Dec 5 16:34:16 2025 +0800
[doc] (doris catalog) doris catalog supports a new cross cluster query
method (#3138)
https://github.com/apache/doris/pull/57898
## Versions
- [ x ] dev
- [ x ] 4.x
- [ ] 3.x
- [ ] 2.1
## Languages
- [ x ] Chinese
- [ ] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---------
Co-authored-by: zhangshixin.1024 <[email protected]>
Co-authored-by: Mingyu Chen (Rayner) <[email protected]>
---
docs/lakehouse/catalogs/doris-catalog.mdx | 153 ++++++++++++++++++---
.../backup-and-restore/CREATE-REPOSITORY.md | 3 +-
.../current/lakehouse/catalogs/doris-catalog.mdx | 131 ++++++++++++++++--
.../lakehouse/catalogs/doris-catalog.mdx | 131 ++++++++++++++++--
.../Lakehouse/doris-catalog/arrow-flight-mode.png | Bin 0 -> 98969 bytes
.../doris-catalog/virtual-cluster-mode.png | Bin 0 -> 65048 bytes
.../backup-and-restore/CREATE-REPOSITORY.md | 5 +-
.../backup-and-restore/CREATE-REPOSITORY.md | 3 +-
.../lakehouse/catalogs/doris-catalog.mdx | 153 ++++++++++++++++++---
.../backup-and-restore/CREATE-REPOSITORY.md | 3 +-
10 files changed, 507 insertions(+), 75 deletions(-)
diff --git a/docs/lakehouse/catalogs/doris-catalog.mdx
b/docs/lakehouse/catalogs/doris-catalog.mdx
index b4a4c023390..9658d1e71a4 100644
--- a/docs/lakehouse/catalogs/doris-catalog.mdx
+++ b/docs/lakehouse/catalogs/doris-catalog.mdx
@@ -8,10 +8,6 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-Doris Catalog allows users to access data across multiple Doris clusters
through HTTP protocol and Arrow Flight protocol.
-
-This document introduces how to configure remote Doris cluster connections and
perform queries.
-
:::note
This feature is supported since version 4.0.2.
@@ -22,7 +18,7 @@ This is an experimental feature.
| Scenario | Description |
|----------|-------------|
-| Federated Query | Doris enables associative queries across multiple
independent Doris clusters through predicate pushdown and Arrow Flight protocol
|
+| Federated Query | Join queries across multiple independent Doris clusters |
## Configuring Catalog
@@ -33,6 +29,8 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
'type' = 'doris', -- required
'fe_http_hosts' =
'http://<fe-host1>:<fe-http-port>,<fe-host2>:<fe-http-port>', -- required
'fe_arrow_hosts' =
'<fe-host1>:<fe-arrow-flight-port>,<fe-host2>:<fe-arrow-flight-port>', --
required
+ 'fe_thrift_hosts' =
'<fe-host1>:<fe-thrift-port>,<fe-host2>:<fe-thrift-port>', -- required
+ 'use_arrow_flight' = 'true/false', -- required
'user' = '', -- required
'password' = '', -- required
{QueryProperties},
@@ -49,15 +47,23 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
List of remote Doris cluster FE Arrow Flight service endpoints.
+* `fe_thrift_hosts`
+
+ List of remote Doris cluster FE Thrift service endpoints.
+
+* `use_arrow_flight`
+
+ Whether to access the remote Doris cluster using Arrow Flight or treat
remote tables as internal tables and send execution plans to the remote Doris
cluster for execution.
+
* `{QueryProperties}`
Optional properties
| Parameter Name | Description | Default Value |
|----------------|-------------|---------------|
- | `enable_parallel_result_sink` | When enabled, local Doris BE nodes will
pull data from remote Doris cluster BE nodes in parallel. | true |
- | `query_retry_count` | Maximum number of retries for failed query requests
to remote Doris. (Does not include failures that may occur during asynchronous
execution after the request is accepted) | 3 |
- | `query_timeout_sec` | Timeout for sending queries to remote Doris. (Does
not include asynchronous execution time after the request is accepted) | 15 |
+ | `enable_parallel_result_sink` | When enabled, local Doris BE nodes will
pull data in parallel from each BE node of the remote Doris cluster. (For Arrow
Flight mode) | true |
+ | `query_retry_count` | Maximum retry count for failed query requests to
remote Doris. (Does not include failures that may occur during asynchronous
execution after the request is accepted by remote Doris) | 3 |
+ | `query_timeout_sec` | Timeout for sending queries to remote Doris. (Does
not include asynchronous execution time after the request is accepted by remote
Doris) | 15 |
| `compatible` | Used to attempt compatibility with metadata formats when
accessing remote Doris with versions lower than the local cluster. No need to
enable when cluster versions are consistent. | false |
* `{HttpClientProperties}`
@@ -67,27 +73,53 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| Parameter Name | Description | Default Value |
|----------------|-------------|---------------|
| `metadata_http_ssl_enabled` | Whether to enable SSL/TLS encrypted
communication for HTTP metadata synchronization. | false |
- | `metadata_sync_retry_count` | Maximum retry count for failed HTTP requests
| 3 |
- | `metadata_max_idle_connections` | Maximum idle connections for HTTP
metadata synchronization client | 5 |
- | `metadata_keep_alive_duration_sec` | Keep-alive duration for HTTP metadata
synchronization client idle connections | 300 |
- | `metadata_connect_timeout_sec` | TCP connection timeout for HTTP metadata
synchronization client | 10 |
- | `metadata_read_timeout_sec` | Socket read timeout for HTTP metadata
synchronization client | 10 |
- | `metadata_write_timeout_sec` | Socket write timeout for HTTP metadata
synchronization client | 10 |
- | `metadata_call_timeout_sec` | Total HTTP request timeout for HTTP metadata
synchronization client | 10 |
+ | `metadata_sync_retry_count` | Maximum retry count for HTTP request
failures | 3 |
+ | `metadata_max_idle_connections` | Maximum idle connections for HTTP
metadata sync client | 5 |
+ | `metadata_keep_alive_duration_sec` | Idle connection keep-alive duration
for HTTP metadata sync client | 300 |
+ | `metadata_connect_timeout_sec` | TCP connection timeout for HTTP metadata
sync client | 10 |
+ | `metadata_read_timeout_sec` | Socket read timeout for HTTP metadata sync
client | 10 |
+ | `metadata_write_timeout_sec` | Socket write timeout for HTTP metadata sync
client | 10 |
+ | `metadata_call_timeout_sec` | HTTP request total timeout for HTTP metadata
sync client | 10 |
* `{CommonProperties}`
CommonProperties section is used to fill in common properties. Please refer
to the [Common Properties] section in the Data Catalog Overview.
+## Access Modes
+
+### Arrow Flight Mode
+
+When the `use_arrow_flight` property is `true`, it operates in Arrow Flight
mode.
+
+
+
+In this mode, during cross-cluster queries, FEs synchronize schema and other
metadata through HTTP protocol, then local cluster BE nodes access the Remote
Doris cluster through Arrow Flight interface.
+
+**Advantages**: Minimal overhead on FE, execution plan only generates query
SQL to send to remote cluster
+
+**Disadvantages**: May not be able to utilize various optimization features of
Doris internal tables, such as aggregation pushdown, limited predicate
pushdown, etc.
+
+### Virtual Cluster Mode
+
+When the `use_arrow_flight` property is `false`, it operates in virtual
cluster mode.
+
+
+
+In this mode, during cross-cluster queries, Backend nodes in the Remote Doris
cluster are treated as virtual nodes for query planning.
+
+FEs synchronize schema and other metadata through HTTP protocol. BEs directly
transfer data through internal communication protocol.
+
+**Advantages**: Can basically utilize all optimization features of Doris
internal table queries. Query execution process is consistent with
single-cluster internal process.
+
+**Disadvantages**: For large remote tables, it will obtain all information of
remote tables (partition information, replica information). FE memory overhead
will increase, requiring FE memory expansion. When cluster versions are
inconsistent, such as higher version querying lower version, query failures may
occur.
+
## Column Type Mapping
-Doris external table types are exactly the same as local Doris types.
+Doris external table types are completely identical to local Doris types.
## Query Operations
-### Basic Queries
-
-After configuring the Catalog, you can query table data in the Catalog in the
following ways:
+After configuring the Catalog, you can query table data in the Catalog through
the following methods:
```sql
-- 1. switch to catalog, use database and query
@@ -103,10 +135,15 @@ SELECT * FROM doris_tbl LIMIT 10;
SELECT * FROM doris_ctl.doris_db.doris_tbl LIMIT 10;
```
-### Query Optimization
+## Query Optimization
-When Doris Catalog accesses data sources, Doris will try to push down
predicates or function conditions and concatenate them into the generated SQL.
You can view the generated SQL statement through EXPLAIN SQL.
-```
+### Arrow Flight Mode
+
+In this mode, Doris will try to push down predicate or function conditions and
concatenate them into the generated SQL.
+
+You can view the generated SQL statement through EXPLAIN SQL.
+
+```sql
...
| 0:VREMOTE_DORIS_SCAN_NODE(68)
|
| TABLE: test.test_time
|
@@ -115,3 +152,75 @@ When Doris Catalog accesses data sources, Doris will try
to push down predicates
...
```
+### Virtual Cluster Mode
+
+In this mode, the execution plan still shows `VOlapScanNode`.
+
+Various optimizations for internal table queries in Doris can continue to be
utilized, such as Join Runtime Filter.
+
+```sql
+MySQL [(none)]> explain select * from demo.inner_table a join
edoris.external.example_tbl_duplicate b on (a.log_type = b.log_type) where
error_code=2;
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| Explain String(Nereids Planner)
|
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| PLAN FRAGMENT 0
|
+| OUTPUT EXPRS:
|
+| log_type[#16]
|
+| reason[#17]
|
+| log_time[#18]
|
+| log_type[#19]
|
+| error_code[#20]
|
+| error_msg[#21]
|
+| op_id[#22]
|
+| op_time[#23]
|
+| PARTITION: HASH_PARTITIONED: log_type[#6]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| VRESULT SINK
|
+| MYSQL_PROTOCOL
|
+|
|
+| 3:VHASH JOIN(200)
|
+| | join op: INNER JOIN(BROADCAST)[]
|
+| | equal join conjunct: (log_type[#6] = log_type[#1])
|
+| | cardinality=3
|
+| | vec output tuple id: 3
|
+| | output tuple id: 3
|
+| | vIntermediate tuple ids: 2
|
+| | hash output slot ids: 0 1 2 3 4 5 6 7
|
+| | runtime filters: RF000[min_max] <- log_type[#1](1/1/1048576),
RF001[in_or_bloom] <- log_type[#1](1/1/1048576) |
+| | final projections: log_type[#8], reason[#9], log_time[#10],
log_type[#11], error_code[#12], error_msg[#13], op_id[#14], op_time[#15] |
+| | final project output tuple id: 3
|
+| | distribute expr lists: log_type[#6]
|
+| | distribute expr lists:
|
+| |
|
+| |----1:VEXCHANGE
|
+| | offset: 0
|
+| | distribute expr lists: log_type[#1]
|
+| |
|
+| 2:VOlapScanNode(187)
|
+| TABLE: demo.inner_table(inner_table), PREAGGREGATION: ON
|
+| partitions=1/1 (inner_table)
|
+| tablets=1/1, tabletList=1762832514491
|
+| cardinality=3, avgRowSize=901.6666, numNodes=1
|
+| pushAggOp=NONE
|
+| runtime filters: RF000[min_max] -> log_type[#6], RF001[in_or_bloom] ->
log_type[#6] |
+|
|
+| PLAN FRAGMENT 1
|
+|
|
+| PARTITION: HASH_PARTITIONED: log_type[#1]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| STREAM DATA SINK
|
+| EXCHANGE ID: 01
|
+| UNPARTITIONED
|
+|
|
+| 0:VOlapScanNode(188)
|
+| TABLE: external.example_tbl_duplicate(example_tbl_duplicate),
PREAGGREGATION: ON |
+| PREDICATES: (error_code[#2] = 2)
|
+| partitions=1/1 (example_tbl_duplicate)
|
+| tablets=1/1, tabletList=1762481736238
|
+| cardinality=1, avgRowSize=7425.0, numNodes=1
|
+| pushAggOp=NONE
+```
\ No newline at end of file
diff --git
a/docs/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
b/docs/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
index 5a09e76041b..b72e89d7988 100644
---
a/docs/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
+++
b/docs/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
@@ -77,8 +77,7 @@ PROPERTIES
"s3.endpoint" = "http://s3-REGION.amazonaws.com",
"s3.region" = "s3-REGION",
"s3.access_key" = "AWS_ACCESS_KEY",
- "s3.secret_key"="AWS_SECRET_KEY",
- "s3.region" = "REGION"
+ "s3.secret_key"="AWS_SECRET_KEY"
);
```
**Note: **
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/doris-catalog.mdx
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/doris-catalog.mdx
index c7c74d3ef6f..5fd0dce3ca1 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/doris-catalog.mdx
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/doris-catalog.mdx
@@ -8,10 +8,6 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-Doris Catalog 允许用户通过 HTTP 协议和 Arrow Flight 协议进行跨多个 Doris 集群的数据访问。
-
-本文档介绍如何配置远程 Doris 集群连接并进行查询。
-
:::note
该功能自 4.0.2 版本支持。
@@ -22,7 +18,7 @@ Doris Catalog 允许用户通过 HTTP 协议和 Arrow Flight 协议进行跨多
| 场景 | 说明
|
|---------|-----------------------------------------------------------------------|
-| 联邦查询 | Doris 通过谓词下推与 Arrow Flight 协议,实现对多个独立 Doris 集群的关联查询 |
+| 联邦查询 | 对多个独立 Doris 集群的关联查询 |
## 配置 Catalog
@@ -33,6 +29,8 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
'type' = 'doris', -- required
'fe_http_hosts' =
'http://<fe-host1>:<fe-http-port>,<fe-host2>:<fe-http-port>', -- required
'fe_arrow_hosts' =
'<fe-host1>:<fe-arrow-flight-port>,<fe-host2>:<fe-arrow-flight-port>', --
required
+ 'fe_thrift_hosts' =
'<fe-host1>:<fe-thrift-port>,<fe-host2>:<fe-thrift-port>', -- required
+ 'use_arrow_flight' = 'true/false', -- required
'user' = '', -- required
'password' = '', -- required
{QueryProperties},
@@ -49,13 +47,21 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
远端 Doris 集群 FE Arrow Flight 服务端点列表。
+* `fe_thrift_hosts`
+
+远端 Doris 集群 FE Thrift 服务端点列表。
+
+* `use_arrow_flight`
+
+采用 Arrow Flight 方式访问远端 Doris 集群还是将远端表当做内表执行计划发送给远端 Doris 集群执行
+
* `{QueryProperties}`
可选属性
| 参数名称 | 说明
| 默认值 |
|-----------------------------|------------------------------------------------------------------------------------------|-------|
- | `enable_parallel_result_sink` | 开启后,本地 Doris BE 节点将并行地从远端 Doris 集群各 BE
节点拉取数据。 | true |
+ | `enable_parallel_result_sink` | 开启后,本地 Doris BE 节点将并行地从远端 Doris 集群各 BE
节点拉取数据。(针对 Arrow Flight 方式) | true |
| `query_retry_count` | 向远端 Doris 发送查询请求失败的最大重试次数。(不包含请求被接受后,远端
Doris 异步执行过程中可能发生的失败) | 3 |
| `query_timeout_sec` | 向远端 Doris 发送查询的超时时间。(不包含请求被接受后,远端 Doris
异步执行时间) | 15 |
| `compatible` | 用于在访问版本低于本集群的远端 Doris
时,尝试兼容其元数据格式。集群版本一致时无需开启。 | false |
@@ -80,14 +86,40 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
CommonProperties 部分用于填写通用属性。请参阅 数据目录概述 中【通用属性】部分。
+## 访问模式
+
+### Arrow Flight 模式
+
+当 `use_arrow_flight` 属性为 `true` 时,则为 Arrow Flight 模式。
+
+
+
+在该模式下进行跨集群查询时,FE 之间通过 HTTP 协议同步 Schema 等元信息,然后本地集群的 BE 节点,通过 Arrow Flight 接口访问
Remote Doris 集群。
+
+**优点**:对于 FE 基本没开销,执行计划仅生成查询 SQL 发往远端集群
+
+**缺点**:可能无法利用 Doris 内表的各种优化特性,如聚合下推、有限的谓词下推等。
+
+### 虚拟集群模式
+
+当 `use_arrow_flight` 属性为 `false` 时,则为虚拟集群模式。
+
+
+
+在该模式下进行跨集群查询时,会将 Remote Doris 集群中的 Backend 节点当做虚拟节点进行查询规划。
+
+FE 之间通过 HTTP 协议同步 Schema 等元信息。BE 直接通过内部通信协议进行数据传输。
+
+**优点**:基本可以利用 Doris 内表查询的所有优化特性。查询执行流程和单集群内部流程一致。
+
+**缺点**:对于较大的远端表来说,会获取远端表的所有信息 (分区信息,副本信息)。FE 的内存开销会上升,需要扩大 FE
内存。在各集群版本不一致时,比如高版本查询低版本,可能会出现查询失败。
+
## 列类型映射
Doris 外表类型与本地 Doris 类型完全相同。
## 查询操作
-### 基础查询
-
配置好 Catalog 后,可以通过以下方式查询 Catalog 中的表数据:
```sql
@@ -104,10 +136,15 @@ SELECT * FROM doris_tbl LIMIT 10;
SELECT * FROM doris_ctl.doris_db.doris_tbl LIMIT 10;
```
-### 查询优化
+## 查询优化
-Doris Catalog 访问数据源时,Doris 会尽量将谓词或函数条件下推并拼接到生成的 SQL 中。可以通过 EXPLAIN SQL 查看到生成的
SQL 语句。
-```
+### Arrow Flight 模式
+
+该模式下,Doris 会尽量将谓词或函数条件下推并拼接到生成的 SQL 中。
+
+可以通过 EXPLAIN SQL 查看到生成的 SQL 语句。
+
+```sql
...
| 0:VREMOTE_DORIS_SCAN_NODE(68)
|
| TABLE: test.test_time
|
@@ -116,3 +153,75 @@ Doris Catalog 访问数据源时,Doris 会尽量将谓词或函数条件下推
...
```
+### 虚拟集群模式
+
+该模式下,执行计划看到的依然是 `VOlapScanNode`。
+
+Doris 针对内表查询的各种优化都可以继续利用,比如 Join Runtime Filter。
+
+```sql
+MySQL [(none)]> explain select * from demo.inner_table a join
edoris.external.example_tbl_duplicate b on (a.log_type = b.log_type) where
error_code=2;
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| Explain String(Nereids Planner)
|
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| PLAN FRAGMENT 0
|
+| OUTPUT EXPRS:
|
+| log_type[#16]
|
+| reason[#17]
|
+| log_time[#18]
|
+| log_type[#19]
|
+| error_code[#20]
|
+| error_msg[#21]
|
+| op_id[#22]
|
+| op_time[#23]
|
+| PARTITION: HASH_PARTITIONED: log_type[#6]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| VRESULT SINK
|
+| MYSQL_PROTOCOL
|
+|
|
+| 3:VHASH JOIN(200)
|
+| | join op: INNER JOIN(BROADCAST)[]
|
+| | equal join conjunct: (log_type[#6] = log_type[#1])
|
+| | cardinality=3
|
+| | vec output tuple id: 3
|
+| | output tuple id: 3
|
+| | vIntermediate tuple ids: 2
|
+| | hash output slot ids: 0 1 2 3 4 5 6 7
|
+| | runtime filters: RF000[min_max] <- log_type[#1](1/1/1048576),
RF001[in_or_bloom] <- log_type[#1](1/1/1048576) |
+| | final projections: log_type[#8], reason[#9], log_time[#10],
log_type[#11], error_code[#12], error_msg[#13], op_id[#14], op_time[#15] |
+| | final project output tuple id: 3
|
+| | distribute expr lists: log_type[#6]
|
+| | distribute expr lists:
|
+| |
|
+| |----1:VEXCHANGE
|
+| | offset: 0
|
+| | distribute expr lists: log_type[#1]
|
+| |
|
+| 2:VOlapScanNode(187)
|
+| TABLE: demo.inner_table(inner_table), PREAGGREGATION: ON
|
+| partitions=1/1 (inner_table)
|
+| tablets=1/1, tabletList=1762832514491
|
+| cardinality=3, avgRowSize=901.6666, numNodes=1
|
+| pushAggOp=NONE
|
+| runtime filters: RF000[min_max] -> log_type[#6], RF001[in_or_bloom] ->
log_type[#6] |
+|
|
+| PLAN FRAGMENT 1
|
+|
|
+| PARTITION: HASH_PARTITIONED: log_type[#1]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| STREAM DATA SINK
|
+| EXCHANGE ID: 01
|
+| UNPARTITIONED
|
+|
|
+| 0:VOlapScanNode(188)
|
+| TABLE: external.example_tbl_duplicate(example_tbl_duplicate),
PREAGGREGATION: ON |
+| PREDICATES: (error_code[#2] = 2)
|
+| partitions=1/1 (example_tbl_duplicate)
|
+| tablets=1/1, tabletList=1762481736238
|
+| cardinality=1, avgRowSize=7425.0, numNodes=1
|
+| pushAggOp=NONE
+```
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
index c7c74d3ef6f..5fd0dce3ca1 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
@@ -8,10 +8,6 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-Doris Catalog 允许用户通过 HTTP 协议和 Arrow Flight 协议进行跨多个 Doris 集群的数据访问。
-
-本文档介绍如何配置远程 Doris 集群连接并进行查询。
-
:::note
该功能自 4.0.2 版本支持。
@@ -22,7 +18,7 @@ Doris Catalog 允许用户通过 HTTP 协议和 Arrow Flight 协议进行跨多
| 场景 | 说明
|
|---------|-----------------------------------------------------------------------|
-| 联邦查询 | Doris 通过谓词下推与 Arrow Flight 协议,实现对多个独立 Doris 集群的关联查询 |
+| 联邦查询 | 对多个独立 Doris 集群的关联查询 |
## 配置 Catalog
@@ -33,6 +29,8 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
'type' = 'doris', -- required
'fe_http_hosts' =
'http://<fe-host1>:<fe-http-port>,<fe-host2>:<fe-http-port>', -- required
'fe_arrow_hosts' =
'<fe-host1>:<fe-arrow-flight-port>,<fe-host2>:<fe-arrow-flight-port>', --
required
+ 'fe_thrift_hosts' =
'<fe-host1>:<fe-thrift-port>,<fe-host2>:<fe-thrift-port>', -- required
+ 'use_arrow_flight' = 'true/false', -- required
'user' = '', -- required
'password' = '', -- required
{QueryProperties},
@@ -49,13 +47,21 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
远端 Doris 集群 FE Arrow Flight 服务端点列表。
+* `fe_thrift_hosts`
+
+远端 Doris 集群 FE Thrift 服务端点列表。
+
+* `use_arrow_flight`
+
+采用 Arrow Flight 方式访问远端 Doris 集群还是将远端表当做内表执行计划发送给远端 Doris 集群执行
+
* `{QueryProperties}`
可选属性
| 参数名称 | 说明
| 默认值 |
|-----------------------------|------------------------------------------------------------------------------------------|-------|
- | `enable_parallel_result_sink` | 开启后,本地 Doris BE 节点将并行地从远端 Doris 集群各 BE
节点拉取数据。 | true |
+ | `enable_parallel_result_sink` | 开启后,本地 Doris BE 节点将并行地从远端 Doris 集群各 BE
节点拉取数据。(针对 Arrow Flight 方式) | true |
| `query_retry_count` | 向远端 Doris 发送查询请求失败的最大重试次数。(不包含请求被接受后,远端
Doris 异步执行过程中可能发生的失败) | 3 |
| `query_timeout_sec` | 向远端 Doris 发送查询的超时时间。(不包含请求被接受后,远端 Doris
异步执行时间) | 15 |
| `compatible` | 用于在访问版本低于本集群的远端 Doris
时,尝试兼容其元数据格式。集群版本一致时无需开启。 | false |
@@ -80,14 +86,40 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
CommonProperties 部分用于填写通用属性。请参阅 数据目录概述 中【通用属性】部分。
+## 访问模式
+
+### Arrow Flight 模式
+
+当 `use_arrow_flight` 属性为 `true` 时,则为 Arrow Flight 模式。
+
+
+
+在该模式下进行跨集群查询时,FE 之间通过 HTTP 协议同步 Schema 等元信息,然后本地集群的 BE 节点,通过 Arrow Flight 接口访问
Remote Doris 集群。
+
+**优点**:对于 FE 基本没开销,执行计划仅生成查询 SQL 发往远端集群
+
+**缺点**:可能无法利用 Doris 内表的各种优化特性,如聚合下推、有限的谓词下推等。
+
+### 虚拟集群模式
+
+当 `use_arrow_flight` 属性为 `false` 时,则为虚拟集群模式。
+
+
+
+在该模式下进行跨集群查询时,会将 Remote Doris 集群中的 Backend 节点当做虚拟节点进行查询规划。
+
+FE 之间通过 HTTP 协议同步 Schema 等元信息。BE 直接通过内部通信协议进行数据传输。
+
+**优点**:基本可以利用 Doris 内表查询的所有优化特性。查询执行流程和单集群内部流程一致。
+
+**缺点**:对于较大的远端表来说,会获取远端表的所有信息 (分区信息,副本信息)。FE 的内存开销会上升,需要扩大 FE
内存。在各集群版本不一致时,比如高版本查询低版本,可能会出现查询失败。
+
## 列类型映射
Doris 外表类型与本地 Doris 类型完全相同。
## 查询操作
-### 基础查询
-
配置好 Catalog 后,可以通过以下方式查询 Catalog 中的表数据:
```sql
@@ -104,10 +136,15 @@ SELECT * FROM doris_tbl LIMIT 10;
SELECT * FROM doris_ctl.doris_db.doris_tbl LIMIT 10;
```
-### 查询优化
+## 查询优化
-Doris Catalog 访问数据源时,Doris 会尽量将谓词或函数条件下推并拼接到生成的 SQL 中。可以通过 EXPLAIN SQL 查看到生成的
SQL 语句。
-```
+### Arrow Flight 模式
+
+该模式下,Doris 会尽量将谓词或函数条件下推并拼接到生成的 SQL 中。
+
+可以通过 EXPLAIN SQL 查看到生成的 SQL 语句。
+
+```sql
...
| 0:VREMOTE_DORIS_SCAN_NODE(68)
|
| TABLE: test.test_time
|
@@ -116,3 +153,75 @@ Doris Catalog 访问数据源时,Doris 会尽量将谓词或函数条件下推
...
```
+### 虚拟集群模式
+
+该模式下,执行计划看到的依然是 `VOlapScanNode`。
+
+Doris 针对内表查询的各种优化都可以继续利用,比如 Join Runtime Filter。
+
+```sql
+MySQL [(none)]> explain select * from demo.inner_table a join
edoris.external.example_tbl_duplicate b on (a.log_type = b.log_type) where
error_code=2;
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| Explain String(Nereids Planner)
|
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| PLAN FRAGMENT 0
|
+| OUTPUT EXPRS:
|
+| log_type[#16]
|
+| reason[#17]
|
+| log_time[#18]
|
+| log_type[#19]
|
+| error_code[#20]
|
+| error_msg[#21]
|
+| op_id[#22]
|
+| op_time[#23]
|
+| PARTITION: HASH_PARTITIONED: log_type[#6]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| VRESULT SINK
|
+| MYSQL_PROTOCOL
|
+|
|
+| 3:VHASH JOIN(200)
|
+| | join op: INNER JOIN(BROADCAST)[]
|
+| | equal join conjunct: (log_type[#6] = log_type[#1])
|
+| | cardinality=3
|
+| | vec output tuple id: 3
|
+| | output tuple id: 3
|
+| | vIntermediate tuple ids: 2
|
+| | hash output slot ids: 0 1 2 3 4 5 6 7
|
+| | runtime filters: RF000[min_max] <- log_type[#1](1/1/1048576),
RF001[in_or_bloom] <- log_type[#1](1/1/1048576) |
+| | final projections: log_type[#8], reason[#9], log_time[#10],
log_type[#11], error_code[#12], error_msg[#13], op_id[#14], op_time[#15] |
+| | final project output tuple id: 3
|
+| | distribute expr lists: log_type[#6]
|
+| | distribute expr lists:
|
+| |
|
+| |----1:VEXCHANGE
|
+| | offset: 0
|
+| | distribute expr lists: log_type[#1]
|
+| |
|
+| 2:VOlapScanNode(187)
|
+| TABLE: demo.inner_table(inner_table), PREAGGREGATION: ON
|
+| partitions=1/1 (inner_table)
|
+| tablets=1/1, tabletList=1762832514491
|
+| cardinality=3, avgRowSize=901.6666, numNodes=1
|
+| pushAggOp=NONE
|
+| runtime filters: RF000[min_max] -> log_type[#6], RF001[in_or_bloom] ->
log_type[#6] |
+|
|
+| PLAN FRAGMENT 1
|
+|
|
+| PARTITION: HASH_PARTITIONED: log_type[#1]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| STREAM DATA SINK
|
+| EXCHANGE ID: 01
|
+| UNPARTITIONED
|
+|
|
+| 0:VOlapScanNode(188)
|
+| TABLE: external.example_tbl_duplicate(example_tbl_duplicate),
PREAGGREGATION: ON |
+| PREDICATES: (error_code[#2] = 2)
|
+| partitions=1/1 (example_tbl_duplicate)
|
+| tablets=1/1, tabletList=1762481736238
|
+| cardinality=1, avgRowSize=7425.0, numNodes=1
|
+| pushAggOp=NONE
+```
diff --git a/static/images/Lakehouse/doris-catalog/arrow-flight-mode.png
b/static/images/Lakehouse/doris-catalog/arrow-flight-mode.png
new file mode 100644
index 00000000000..d93cbc14f4f
Binary files /dev/null and
b/static/images/Lakehouse/doris-catalog/arrow-flight-mode.png differ
diff --git a/static/images/Lakehouse/doris-catalog/virtual-cluster-mode.png
b/static/images/Lakehouse/doris-catalog/virtual-cluster-mode.png
new file mode 100644
index 00000000000..88c7a6021f5
Binary files /dev/null and
b/static/images/Lakehouse/doris-catalog/virtual-cluster-mode.png differ
diff --git
a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
index 31076bf370c..bf3155985fa 100644
---
a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
+++
b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
@@ -73,8 +73,7 @@ PROPERTIES
"s3.endpoint" = "http://s3-REGION.amazonaws.com",
"s3.region" = "s3-REGION",
"s3.access_key" = "AWS_ACCESS_KEY",
- "s3.secret_key"="AWS_SECRET_KEY",
- "s3.region" = "REGION"
+ "s3.secret_key"="AWS_SECRET_KEY"
);
```
@@ -136,4 +135,4 @@ PROPERTIES
"s3.endpoint" = "http://cos.ap-beijing.myqcloud.com",
"s3.region" = "ap-beijing"
);
-```
\ No newline at end of file
+```
diff --git
a/versioned_docs/version-3.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
b/versioned_docs/version-3.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
index daef000cd7d..333c61f0ce3 100644
---
a/versioned_docs/version-3.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
+++
b/versioned_docs/version-3.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
@@ -76,8 +76,7 @@ PROPERTIES
"s3.endpoint" = "http://s3-REGION.amazonaws.com",
"s3.region" = "s3-REGION",
"s3.access_key" = "AWS_ACCESS_KEY",
- "s3.secret_key"="AWS_SECRET_KEY",
- "s3.region" = "REGION"
+ "s3.secret_key"="AWS_SECRET_KEY"
);
```
diff --git a/versioned_docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
b/versioned_docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
index b4a4c023390..9658d1e71a4 100644
--- a/versioned_docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
+++ b/versioned_docs/version-4.x/lakehouse/catalogs/doris-catalog.mdx
@@ -8,10 +8,6 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-Doris Catalog allows users to access data across multiple Doris clusters
through HTTP protocol and Arrow Flight protocol.
-
-This document introduces how to configure remote Doris cluster connections and
perform queries.
-
:::note
This feature is supported since version 4.0.2.
@@ -22,7 +18,7 @@ This is an experimental feature.
| Scenario | Description |
|----------|-------------|
-| Federated Query | Doris enables associative queries across multiple
independent Doris clusters through predicate pushdown and Arrow Flight protocol
|
+| Federated Query | Join queries across multiple independent Doris clusters |
## Configuring Catalog
@@ -33,6 +29,8 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
'type' = 'doris', -- required
'fe_http_hosts' =
'http://<fe-host1>:<fe-http-port>,<fe-host2>:<fe-http-port>', -- required
'fe_arrow_hosts' =
'<fe-host1>:<fe-arrow-flight-port>,<fe-host2>:<fe-arrow-flight-port>', --
required
+ 'fe_thrift_hosts' =
'<fe-host1>:<fe-thrift-port>,<fe-host2>:<fe-thrift-port>', -- required
+ 'use_arrow_flight' = 'true/false', -- required
'user' = '', -- required
'password' = '', -- required
{QueryProperties},
@@ -49,15 +47,23 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
List of remote Doris cluster FE Arrow Flight service endpoints.
+* `fe_thrift_hosts`
+
+ List of remote Doris cluster FE Thrift service endpoints.
+
+* `use_arrow_flight`
+
+ Whether to access the remote Doris cluster using Arrow Flight or treat
remote tables as internal tables and send execution plans to the remote Doris
cluster for execution.
+
* `{QueryProperties}`
Optional properties
| Parameter Name | Description | Default Value |
|----------------|-------------|---------------|
- | `enable_parallel_result_sink` | When enabled, local Doris BE nodes will
pull data from remote Doris cluster BE nodes in parallel. | true |
- | `query_retry_count` | Maximum number of retries for failed query requests
to remote Doris. (Does not include failures that may occur during asynchronous
execution after the request is accepted) | 3 |
- | `query_timeout_sec` | Timeout for sending queries to remote Doris. (Does
not include asynchronous execution time after the request is accepted) | 15 |
+ | `enable_parallel_result_sink` | When enabled, local Doris BE nodes will
pull data in parallel from each BE node of the remote Doris cluster. (For Arrow
Flight mode) | true |
+ | `query_retry_count` | Maximum retry count for failed query requests to
remote Doris. (Does not include failures that may occur during asynchronous
execution after the request is accepted by remote Doris) | 3 |
+ | `query_timeout_sec` | Timeout for sending queries to remote Doris. (Does
not include asynchronous execution time after the request is accepted by remote
Doris) | 15 |
| `compatible` | Used to attempt compatibility with metadata formats when
accessing remote Doris with versions lower than the local cluster. No need to
enable when cluster versions are consistent. | false |
* `{HttpClientProperties}`
@@ -67,27 +73,53 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| Parameter Name | Description | Default Value |
|----------------|-------------|---------------|
| `metadata_http_ssl_enabled` | Whether to enable SSL/TLS encrypted
communication for HTTP metadata synchronization. | false |
- | `metadata_sync_retry_count` | Maximum retry count for failed HTTP requests
| 3 |
- | `metadata_max_idle_connections` | Maximum idle connections for HTTP
metadata synchronization client | 5 |
- | `metadata_keep_alive_duration_sec` | Keep-alive duration for HTTP metadata
synchronization client idle connections | 300 |
- | `metadata_connect_timeout_sec` | TCP connection timeout for HTTP metadata
synchronization client | 10 |
- | `metadata_read_timeout_sec` | Socket read timeout for HTTP metadata
synchronization client | 10 |
- | `metadata_write_timeout_sec` | Socket write timeout for HTTP metadata
synchronization client | 10 |
- | `metadata_call_timeout_sec` | Total HTTP request timeout for HTTP metadata
synchronization client | 10 |
+ | `metadata_sync_retry_count` | Maximum retry count for HTTP request
failures | 3 |
+ | `metadata_max_idle_connections` | Maximum idle connections for HTTP
metadata sync client | 5 |
+ | `metadata_keep_alive_duration_sec` | Idle connection keep-alive duration
for HTTP metadata sync client | 300 |
+ | `metadata_connect_timeout_sec` | TCP connection timeout for HTTP metadata
sync client | 10 |
+ | `metadata_read_timeout_sec` | Socket read timeout for HTTP metadata sync
client | 10 |
+ | `metadata_write_timeout_sec` | Socket write timeout for HTTP metadata sync
client | 10 |
+ | `metadata_call_timeout_sec` | HTTP request total timeout for HTTP metadata
sync client | 10 |
* `{CommonProperties}`
CommonProperties section is used to fill in common properties. Please refer
to the [Common Properties] section in the Data Catalog Overview.
+## Access Modes
+
+### Arrow Flight Mode
+
+When the `use_arrow_flight` property is `true`, it operates in Arrow Flight
mode.
+
+
+
+In this mode, during cross-cluster queries, FEs synchronize schema and other
metadata through HTTP protocol, then local cluster BE nodes access the Remote
Doris cluster through Arrow Flight interface.
+
+**Advantages**: Minimal overhead on FE, execution plan only generates query
SQL to send to remote cluster
+
+**Disadvantages**: May not be able to utilize various optimization features of
Doris internal tables, such as aggregation pushdown, limited predicate
pushdown, etc.
+
+### Virtual Cluster Mode
+
+When the `use_arrow_flight` property is `false`, it operates in virtual
cluster mode.
+
+
+
+In this mode, during cross-cluster queries, Backend nodes in the Remote Doris
cluster are treated as virtual nodes for query planning.
+
+FEs synchronize schema and other metadata through HTTP protocol. BEs directly
transfer data through internal communication protocol.
+
+**Advantages**: Can basically utilize all optimization features of Doris
internal table queries. Query execution process is consistent with
single-cluster internal process.
+
+**Disadvantages**: For large remote tables, it will obtain all information of
remote tables (partition information, replica information). FE memory overhead
will increase, requiring FE memory expansion. When cluster versions are
inconsistent, such as higher version querying lower version, query failures may
occur.
+
## Column Type Mapping
-Doris external table types are exactly the same as local Doris types.
+Doris external table types are completely identical to local Doris types.
## Query Operations
-### Basic Queries
-
-After configuring the Catalog, you can query table data in the Catalog in the
following ways:
+After configuring the Catalog, you can query table data in the Catalog through
the following methods:
```sql
-- 1. switch to catalog, use database and query
@@ -103,10 +135,15 @@ SELECT * FROM doris_tbl LIMIT 10;
SELECT * FROM doris_ctl.doris_db.doris_tbl LIMIT 10;
```
-### Query Optimization
+## Query Optimization
-When Doris Catalog accesses data sources, Doris will try to push down
predicates or function conditions and concatenate them into the generated SQL.
You can view the generated SQL statement through EXPLAIN SQL.
-```
+### Arrow Flight Mode
+
+In this mode, Doris will try to push down predicate or function conditions and
concatenate them into the generated SQL.
+
+You can view the generated SQL statement through EXPLAIN SQL.
+
+```sql
...
| 0:VREMOTE_DORIS_SCAN_NODE(68)
|
| TABLE: test.test_time
|
@@ -115,3 +152,75 @@ When Doris Catalog accesses data sources, Doris will try
to push down predicates
...
```
+### Virtual Cluster Mode
+
+In this mode, the execution plan still shows `VOlapScanNode`.
+
+Various optimizations for internal table queries in Doris can continue to be
utilized, such as Join Runtime Filter.
+
+```sql
+MySQL [(none)]> explain select * from demo.inner_table a join
edoris.external.example_tbl_duplicate b on (a.log_type = b.log_type) where
error_code=2;
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| Explain String(Nereids Planner)
|
++-------------------------------------------------------------------------------------------------------------------------------------------+
+| PLAN FRAGMENT 0
|
+| OUTPUT EXPRS:
|
+| log_type[#16]
|
+| reason[#17]
|
+| log_time[#18]
|
+| log_type[#19]
|
+| error_code[#20]
|
+| error_msg[#21]
|
+| op_id[#22]
|
+| op_time[#23]
|
+| PARTITION: HASH_PARTITIONED: log_type[#6]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| VRESULT SINK
|
+| MYSQL_PROTOCOL
|
+|
|
+| 3:VHASH JOIN(200)
|
+| | join op: INNER JOIN(BROADCAST)[]
|
+| | equal join conjunct: (log_type[#6] = log_type[#1])
|
+| | cardinality=3
|
+| | vec output tuple id: 3
|
+| | output tuple id: 3
|
+| | vIntermediate tuple ids: 2
|
+| | hash output slot ids: 0 1 2 3 4 5 6 7
|
+| | runtime filters: RF000[min_max] <- log_type[#1](1/1/1048576),
RF001[in_or_bloom] <- log_type[#1](1/1/1048576) |
+| | final projections: log_type[#8], reason[#9], log_time[#10],
log_type[#11], error_code[#12], error_msg[#13], op_id[#14], op_time[#15] |
+| | final project output tuple id: 3
|
+| | distribute expr lists: log_type[#6]
|
+| | distribute expr lists:
|
+| |
|
+| |----1:VEXCHANGE
|
+| | offset: 0
|
+| | distribute expr lists: log_type[#1]
|
+| |
|
+| 2:VOlapScanNode(187)
|
+| TABLE: demo.inner_table(inner_table), PREAGGREGATION: ON
|
+| partitions=1/1 (inner_table)
|
+| tablets=1/1, tabletList=1762832514491
|
+| cardinality=3, avgRowSize=901.6666, numNodes=1
|
+| pushAggOp=NONE
|
+| runtime filters: RF000[min_max] -> log_type[#6], RF001[in_or_bloom] ->
log_type[#6] |
+|
|
+| PLAN FRAGMENT 1
|
+|
|
+| PARTITION: HASH_PARTITIONED: log_type[#1]
|
+|
|
+| HAS_COLO_PLAN_NODE: false
|
+|
|
+| STREAM DATA SINK
|
+| EXCHANGE ID: 01
|
+| UNPARTITIONED
|
+|
|
+| 0:VOlapScanNode(188)
|
+| TABLE: external.example_tbl_duplicate(example_tbl_duplicate),
PREAGGREGATION: ON |
+| PREDICATES: (error_code[#2] = 2)
|
+| partitions=1/1 (example_tbl_duplicate)
|
+| tablets=1/1, tabletList=1762481736238
|
+| cardinality=1, avgRowSize=7425.0, numNodes=1
|
+| pushAggOp=NONE
+```
\ No newline at end of file
diff --git
a/versioned_docs/version-4.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
b/versioned_docs/version-4.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
index 5a09e76041b..b72e89d7988 100644
---
a/versioned_docs/version-4.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
+++
b/versioned_docs/version-4.x/sql-manual/sql-statements/data-modification/backup-and-restore/CREATE-REPOSITORY.md
@@ -77,8 +77,7 @@ PROPERTIES
"s3.endpoint" = "http://s3-REGION.amazonaws.com",
"s3.region" = "s3-REGION",
"s3.access_key" = "AWS_ACCESS_KEY",
- "s3.secret_key"="AWS_SECRET_KEY",
- "s3.region" = "REGION"
+ "s3.secret_key"="AWS_SECRET_KEY"
);
```
**Note: **
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]