(flink) branch master updated: [FLINK-35652][doc] Add document for lookup custom shuffle

guoweijie Thu, 17 Apr 2025 20:31:39 -0700

This is an automated email from the ASF dual-hosted git repository.

guoweijie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git



The following commit(s) were added to refs/heads/master by this push:
     new 3cfc211e152 [FLINK-35652][doc] Add document for lookup custom shuffle
3cfc211e152 is described below

commit 3cfc211e152a3b86ec5b43e06d3343168afa0f43
Author: yunfengzhou-hub <[email protected]>
AuthorDate: Fri Apr 18 11:31:28 2025 +0800

    [FLINK-35652][doc] Add document for lookup custom shuffle
---
 .../content.zh/docs/dev/table/sql/queries/hints.md | 21 ++++++++++++++++++
 docs/content/docs/dev/table/sql/queries/hints.md   | 25 ++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/docs/content.zh/docs/dev/table/sql/queries/hints.md 
b/docs/content.zh/docs/dev/table/sql/queries/hints.md
index d413b1a459e..a1294dd7cdc 100644
--- a/docs/content.zh/docs/dev/table/sql/queries/hints.md
+++ b/docs/content.zh/docs/dev/table/sql/queries/hints.md
@@ -359,6 +359,14 @@ LOOKUP 联接提示允许用户建议 Flink 优化器:
        <td>N/A</td>
        <td>固定延迟策略的最大重试次数</td>
 </tr>
+<tr>
+       <td>shuffle</td>
+       <td>shuffle</td>
+       <td>N</td>
+       <td>boolean</td>
+       <td>false</td>
+       <td>是否开启自定义数据分发功能。此功能允许 Lookup Source 自行决定数据分布方式并依此对数据查询逻辑做相应优化</td>
+</tr>
 </tbody>
 </table>
 
@@ -445,6 +453,19 @@ LOOKUP('table'='Customers', 'async'='false', 
'retry-predicate'='lookup_miss', 'r
 LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss', 
'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
 ```
 
+#### 4. 启用自定义数据分布
+
+在默认情况下，Lookup Join 的输入流数据分布是随机的，因此数据源可能无法有效利用缓存来加速查找。 用户可以通过如下方式启
+用自定义数据分发，使数据源能够自行决定输入数据的分布，并利用这一先验知识来优化其缓存和查找策略。
+
+```sql
+LOOKUP('table'='Customers', 'shuffle'='true')
+```
+
+为了充分利用这个优化，目标 Lookup Source 应该提供对自定义数据分发能力的支持。连接器开发人员可以通过让 
+LookupTableSource 子类实现 SupportsLookupCustomShuffle 接口来支持这种能力。即使 Source 
尚未提供这种能力，用户
+依然可以选择先启用这个功能，此时 Flink 将会尝试应用哈希分区的优化方式以尽可能带来性能提升。
+
 #### 进一步说明
 
 #### 开启缓存对重试的影响
diff --git a/docs/content/docs/dev/table/sql/queries/hints.md 
b/docs/content/docs/dev/table/sql/queries/hints.md
index 2276269f440..e60d6dbc6ea 100644
--- a/docs/content/docs/dev/table/sql/queries/hints.md
+++ b/docs/content/docs/dev/table/sql/queries/hints.md
@@ -369,6 +369,14 @@ The LOOKUP hint allows users to suggest the Flink 
optimizer to:
        <td>N/A</td>
        <td>max attempt number of the 'fixed_delay' strategy</td>
 </tr>
+<tr>
+       <td>shuffle</td>
+       <td>shuffle</td>
+       <td>N</td>
+       <td>boolean</td>
+       <td>false</td>
+       <td>whether to enable custom lookup shuffle, which allows the lookup 
source to decide input data distribution and to optimize lookup strategy 
accordingly</td>
+</tr>
 </tbody>
 </table>
 
@@ -464,6 +472,23 @@ If the lookup source only has one capability, then the 
'async' mode option can b
 LOOKUP('table'='Customers', 'retry-predicate'='lookup_miss', 
'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')
 ```
 
+#### 4. Enable Custom Data Distribution
+
+By default, the data distribution of Lookup Join's input stream is arbitrary, 
so sources may not
+make effective use of caches to accelerate lookups. By enabling custom shuffle 
as follows, the
+sources would be able to decide the distribution of the input data on their 
own and use this prior
+knowledge to optimize their caches and lookup strategy.
+
+```sql
+LOOKUP('table'='Customers', 'shuffle'='true')
+```
+
+In order to make full use of this feature, the target lookup source should 
have supported custom 
+shuffle. For connector developers, this could be achieved by having the 
`LookupTableSource` subclass 
+implement `SupportsLookupCustomShuffle`. Even if the source has not provided 
such support yet, users
+can still enable this feature first, and then Flink will try best to apply a 
hash partitioning, 
+which should also bring performance improvement.
+
 #### Further Notes
 
 #### Effect Of Enabling Caching On Retries

(flink) branch master updated: [FLINK-35652][doc] Add document for lookup custom shuffle

Reply via email to