This is an automated email from the ASF dual-hosted git repository.
guoweijie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git
The following commit(s) were added to refs/heads/master by this push:
new 3cfc211e152 [FLINK-35652][doc] Add document for lookup custom shuffle
3cfc211e152 is described below
commit 3cfc211e152a3b86ec5b43e06d3343168afa0f43
Author: yunfengzhou-hub <[email protected]>
AuthorDate: Fri Apr 18 11:31:28 2025 +0800
[FLINK-35652][doc] Add document for lookup custom shuffle
---
.../content.zh/docs/dev/table/sql/queries/hints.md | 21 ++++++++++++++++++
docs/content/docs/dev/table/sql/queries/hints.md | 25 ++++++++++++++++++++++
2 files changed, 46 insertions(+)
diff --git a/docs/content.zh/docs/dev/table/sql/queries/hints.md
b/docs/content.zh/docs/dev/table/sql/queries/hints.md
index d413b1a459e..a1294dd7cdc 100644
--- a/docs/content.zh/docs/dev/table/sql/queries/hints.md
+++ b/docs/content.zh/docs/dev/table/sql/queries/hints.md
@@ -359,6 +359,14 @@ LOOKUP 联接提示允许用户建议 Flink 优化器:
<td>N/A</td>
<td>固定延迟策略的最大重试次数</td>
</tr>
+<tr>
+ <td>shuffle</td>
+ <td>shuffle</td>
+ <td>N</td>
+ <td>boolean</td>
+ <td>false</td>
+ <td>是否开启自定义数据分发功能。此功能允许 Lookup Source 自行决定数据分布方式并依此对数据查询逻辑做相应优化</td>
+</tr>
</tbody>
</table>
@@ -445,6 +453,19 @@ LOOKUP('table'='Customers', 'async'='false',
'retry-predicate'='lookup_miss', 'r
LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss',
'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
```
+#### 4. 启用自定义数据分布
+
+在默认情况下,Lookup Join 的输入流数据分布是随机的,因此数据源可能无法有效利用缓存来加速查找。 用户可以通过如下方式启
+用自定义数据分发,使数据源能够自行决定输入数据的分布,并利用这一先验知识来优化其缓存和查找策略。
+
+```sql
+LOOKUP('table'='Customers', 'shuffle'='true')
+```
+
+为了充分利用这个优化,目标 Lookup Source 应该提供对自定义数据分发能力的支持。连接器开发人员可以通过让
+LookupTableSource 子类实现 SupportsLookupCustomShuffle 接口来支持这种能力。即使 Source
尚未提供这种能力,用户
+依然可以选择先启用这个功能,此时 Flink 将会尝试应用哈希分区的优化方式以尽可能带来性能提升。
+
#### 进一步说明
#### 开启缓存对重试的影响
diff --git a/docs/content/docs/dev/table/sql/queries/hints.md
b/docs/content/docs/dev/table/sql/queries/hints.md
index 2276269f440..e60d6dbc6ea 100644
--- a/docs/content/docs/dev/table/sql/queries/hints.md
+++ b/docs/content/docs/dev/table/sql/queries/hints.md
@@ -369,6 +369,14 @@ The LOOKUP hint allows users to suggest the Flink
optimizer to:
<td>N/A</td>
<td>max attempt number of the 'fixed_delay' strategy</td>
</tr>
+<tr>
+ <td>shuffle</td>
+ <td>shuffle</td>
+ <td>N</td>
+ <td>boolean</td>
+ <td>false</td>
+ <td>whether to enable custom lookup shuffle, which allows the lookup
source to decide input data distribution and to optimize lookup strategy
accordingly</td>
+</tr>
</tbody>
</table>
@@ -464,6 +472,23 @@ If the lookup source only has one capability, then the
'async' mode option can b
LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss',
'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
```
+#### 4. Enable Custom Data Distribution
+
+By default, the data distribution of Lookup Join's input stream is arbitrary,
so sources may not
+make effective use of caches to accelerate lookups. By enabling custom shuffle
as follows, the
+sources would be able to decide the distribution of the input data on their
own and use this prior
+knowledge to optimize their caches and lookup strategy.
+
+```sql
+LOOKUP('table'='Customers', 'shuffle'='true')
+```
+
+In order to make full use of this feature, the target lookup source should
have supported custom
+shuffle. For connector developers, this could be achieved by having the
`LookupTableSource` subclass
+implement `SupportsLookupCustomShuffle`. Even if the source has not provided
such support yet, users
+can still enable this feature first, and then Flink will try best to apply a
hash partitioning,
+which should also bring performance improvement.
+
#### Further Notes
#### Effect Of Enabling Caching On Retries