Re: [PR] [FLINK-34256][doc] Add a documentation section for minibatch join [flink]

via GitHub Sun, 04 Feb 2024 20:01:29 -0800


xishuaidelin commented on code in PR #24240:
URL: https://github.com/apache/flink/pull/24240#discussion_r1477643118



##########
docs/content/docs/dev/table/tuning.md:
##########
@@ -266,5 +266,23 @@ GROUP BY day
 Flink SQL optimizer can recognize the different filter arguments on the same 
distinct key. For example, in the above example, all the three COUNT DISTINCT 
are on `user_id` column.
 Then Flink can use just one shared state instance instead of three state 
instances to reduce state access and state size. In some workloads, this can 
get significant performance improvements.
 
+## MiniBatch Join
+
+By default, regular join operator processes input records one by one, i.e., 
(1) look up records from state according to joinKey, (2) write or retract input 
in state, (3) process the input and joined records. This processing pattern may 
increase the overhead of StateBackend (especially for RocksDB StateBackend).
+
+The core idea of mini-batch join is to cache a bundle of inputs in a buffer 
inside of the mini-batch join operator. Reduce data in the cache, and then when 
the cache is triggered for processing, perform specific optimizations based on 
certain scenarios. Some of input records would be folded according to specified 
rule illustrated below:
+
+{{< img src="/fig/table-streaming/folded.png" width="70%" height="70%" >}}
+
+When the bundle of inputs is triggered to be processed, the inputs inside of 
the bundle are records that could not be folded further. Another optimization 
for update records is applied for the bundle. When encountering the pair of -U 
and +U records, they would be recognized and redundant records in their output 
would be suppressed. The graph below explains the principle here.
+
+{{< img src="/fig/table-streaming/suppress.jpg" width="70%" height="70%" >}}
+
+Besides, the order in which the left and right stream bundles are processed 
can also help reduce redundant records, but this only applies to cases where 
there is an outer join. The following graph clarifies the principle:
+
+{{< img src="/fig/table-streaming/order.jpg" width="70%" height="70%" >}}
+
+MiniBatch optimization is disabled by default for regular join. In order to 
enable this optimization, you should set options 
`table.exec.mini-batch.enabled`, `table.exec.mini-batch.allow-latency` and 
`table.exec.mini-batch.size`. Please see [configuration]({{< ref 
"docs/dev/table/config" >}}#execution-options) page for more details.
+The examples could refer to the part of MiniBatch Aggregation.

Review Comment:
   Good idea. Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-34256][doc] Add a documentation section for minibatch join [flink]

Reply via email to