xuyangzhong commented on code in PR #27191:
URL: https://github.com/apache/flink/pull/27191#discussion_r2509251066


##########
docs/content/docs/dev/table/tuning.md:
##########
@@ -368,3 +368,52 @@ FROM TenantKafka t
          LEFT JOIN InventoryKafka i ON t.tenant_id = i.tenant_id AND ...;
 ```
 
+## Delta Joins
+
+In stream jobs, regular joins store all historical data from both sides of the 
input to ensure the accuracy of the computation results. When an input record 
is received, regular joins query the state of the other side to find matching 
records to output, while simultaneously updating its own state.
+However, as the job runs for a long term and the input data increases, the 
state of regular joins will gradually grow larger. This may lead to 
computational resources becoming a bottleneck, impacting the overall 
performance of the job and potentially causing a series of stability issues.
+
+To address this, we have introduced a new delta join operator. The core idea 
is to leverage a bidirectional lookup join approach to reuse data from source 
tables, replacing the state required by regular joins. Compared to traditional 
regular joins, delta joins significantly reduce the state size, improve the 
stability of the job, and also decrease the demand for computational resources.
+
+This feature is currently enabled by default and regular join will be 
optimized into delta join when the following conditions are simultaneously met:
+
+1. The sql pattern satisfies the optimization criteria. For details, please 
refer to [Supported Features and Limitations]({{< ref "docs/dev/table/tuning" 
>}}#supported-features-and-limitations)
+2. The external storage system of the source table provides index information 
for fast querying for delta joins. Currently, [Apache 
Fluss(Incubating)](https://fluss.apache.org/blog/fluss-open-source/) has 
provided index information at the table level for Flink, allowing such tables 
to be used as source tables for delta joins. Please refer to the [Fluss 
documentation](https://fluss.apache.org/docs/0.8/engine-flink/delta-joins/#flink-version-support)
 for more details.
+
+### Working Principle
+
+In Flink, regular joins require completely storing incoming data from both 
sides in the state, matching that data when the opposite side's data arrives. 
In contrast, delta joins utilize the indexing capabilities provided by external 
storage systems to convert the behavior of querying state data into efficient 
queries against data in the external storage system using index keys. This 
approach avoids the need for duplicate storage of the same data in both the 
external storage system and the state.

Review Comment:
   I have updated 
   ```
   In Flink, regular joins store all incoming records from both input sides in 
the state to ensure that future records can be matched correctly when 
corresponding data arrives from the opposite side.
   ```
   to 
   ```
   In Flink, regular joins store all incoming records from both input sides in 
the state to ensure that corresponding records can be matched correctly when 
data arrives from the opposite side.
   ``` 
   to remove the confusing `future records`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to