Re: [PR] [FLINK-34256][doc] Add a documentation section for minibatch join [flink]

via GitHub Sun, 04 Feb 2024 20:02:04 -0800


xishuaidelin commented on code in PR #24240:
URL: https://github.com/apache/flink/pull/24240#discussion_r1477643284



##########
docs/content/docs/dev/table/tuning.md:
##########
@@ -266,5 +266,23 @@ GROUP BY day
 Flink SQL optimizer can recognize the different filter arguments on the same 
distinct key. For example, in the above example, all the three COUNT DISTINCT 
are on `user_id` column.
 Then Flink can use just one shared state instance instead of three state 
instances to reduce state access and state size. In some workloads, this can 
get significant performance improvements.
 
+## MiniBatch Join
+
+By default, regular join operator processes input records one by one, i.e., 
(1) look up records from state according to joinKey, (2) write or retract input 
in state, (3) process the input and joined records. This processing pattern may 
increase the overhead of StateBackend (especially for RocksDB StateBackend).
+
+The core idea of mini-batch join is to cache a bundle of inputs in a buffer 
inside of the mini-batch join operator. Reduce data in the cache, and then when 
the cache is triggered for processing, perform specific optimizations based on 
certain scenarios. Some of input records would be folded according to specified 
rule illustrated below:

Review Comment:
   A new graph to clarify the principle is introduced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-34256][doc] Add a documentation section for minibatch join [flink]

Reply via email to