qidaye commented on a change in pull request #5459:
URL: https://github.com/apache/incubator-doris/pull/5459#discussion_r587297331
##########
File path: docs/en/administrator-guide/bucket-shuffle-join.md
##########
@@ -0,0 +1,106 @@
+```
+---
+{
+ "title": "Bucket Shuffle Join",
+ "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+```
+# Bucket Shuffle Join
+
+Bucket Shuffle Join is a new function officially added in Doris 0.14. The
purpose is to provide local optimization for some join queries to reduce the
time-consuming of data transmission between nodes and speed up the query.
+
+It's design, implementation can be referred to [ISSUE
4394](https://github.com/apache/incubator-doris/issues/4394)。
+
+## Noun Interpretation
+
+* FE: Frontend, the front-end node of Doris. Responsible for metadata
management and request access.
+* BE: Backend, Doris's back-end node. Responsible for query execution and data
storage.
+* Left table: the left table in join query. Perform probe expr. The order can
be adjusted by join reorder.
+* Right table: the right table in join query. Perform build expr The order can
be adjusted by join reorder.
+
+## Principle
+In addition to bucket shuffle join, Doris supports three types of join:
`Shuffle Join, Broadcast Join, Colocate Join`. Except `colorate join`, other
types of join will lead to some network overhead.
+
+For example, there are join queries for table A and table B. the join method
is hashjoin. The cost of different join types is as follows:
+* **Broadcast Join**: If table a has three executing hashjoinnodes according
to the data distribution, table B needs to be sent to the three HashJoinNode.
Its network overhead is `3B `, and its memory overhead is `3B`.
+* **Shuffle Join**: Shuffle join will distribute the data of tables A and B to
the nodes of the cluster according to hash calculation, so its network overhead
is `A + B` and memory overhead is `B`.
+
+The data distribution information of each Doris table is saved in FE. If the
join statement hits the data distribution column of the left table, we should
use the data distribution information to reduce the network and memory overhead
of the join query. This is the source of the idea of bucket shuffle join.
+
+
Review comment:
This image should be included in the Doris repository and should not
rely on a third-party link.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]