xuyangzhong commented on code in PR #27191:
URL: https://github.com/apache/flink/pull/27191#discussion_r2498260098
##########
docs/content/docs/dev/table/tuning.md:
##########
@@ -368,3 +368,55 @@ FROM TenantKafka t
LEFT JOIN InventoryKafka i ON t.tenant_id = i.tenant_id AND ...;
```
+## Delta Joins
+
+In streaming jobs, the large state of regular join has been a persistent
concern for users. Since streaming jobs are long-running, the state of regular
joins generally increases in size over time, leading to a series of issues,
including but not limited to:
+
+1. Excessive consumption of computing or storage resources, which becomes a
bottleneck for the job.
+2. Slow checkpointing, which affects job stability.
+3. Long recovery time from state during failures or restarts.
+
+Delta join primarily addresses these issues. The core idea of delta join is to
leverage a bidirectional lookup join approach to reuse data from source tables,
thereby replacing the state required by regular joins. Compared to traditional
regular joins, delta joins significantly reduce the state size, ensure job
stability, decrease computing resource requirements, and accelerate job
recovery times.
+
+This feature is currently enabled by default and regular join will be
optimized into delta join when the following conditions are simultaneously met:
+
+1. The sql pattern satisfies the optimization criteria. For details, please
refer to [Supported Features and Limitations]({{< ref "docs/dev/table/tuning"
>}}#supported-features-and-limitations)
+2. The external storage system of the source table provides index information
for fast querying for delta joins. Currently, [Apache
Fluss(Incubating)](https://fluss.apache.org/blog/fluss-open-source/) has
provided index information at the table level for Flink, allowing such tables
to be used as source tables for delta joins. Please refer to the [Fluss
documentation](https://fluss.apache.org/docs/0.8/engine-flink/delta-joins/#flink-version-support)
for more details.
+
+### Working Principle
+
+In Flink, regular joins require completely storing incoming data from both
sides in the state, matching that data when the opposite side's data arrives.
In contrast, delta joins utilize the indexing capabilities provided by external
storage systems to convert the behavior of querying state data into efficient
queries against data in the external storage system using index keys. This
approach avoids the need for duplicate storage of the same data in both the
external storage system and the state.
+
+{{< img src="/fig/table-streaming/delta_join.png" width="100%" height="100%"
>}}
Review Comment:
You're right.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]