[GitHub] [flink-table-store] yuzelin commented on a diff in pull request #563: [FLINK-31252] Improve StaticFileStoreSplitEnumerator to assign batch splits

via GitHub Wed, 01 Mar 2023 22:57:40 -0800


yuzelin commented on code in PR #563:
URL: https://github.com/apache/flink-table-store/pull/563#discussion_r1122663315



##########
flink-table-store-flink/flink-table-store-flink-common/src/main/java/org/apache/flink/table/store/connector/source/StaticFileStoreSplitEnumerator.java:
##########
@@ -61,17 +75,27 @@ public void handleSplitRequest(int subtask, @Nullable 
String hostname) {
             return;
         }
 
-        FileStoreSourceSplit split = splits.poll();
-        if (split != null) {
-            context.assignSplit(split, subtask);
+        // The following batch assignment operation is for two things:
+        // 1. It can be evenly distributed during batch reading to avoid 
scheduling problems (for
+        // example, the current resource can only schedule part of the tasks) 
that cause some tasks
+        // to fail to read data.
+        // 2. Read with limit, if split is assigned one by one, it may cause 
the task to repeatedly
+        // create SplitFetchers. After the task is created, it is found that 
it is idle and then
+        // closed. Then, new split coming, it will create SplitFetcher and 
repeatedly read the data
+        // of the limit number (the limit status is in the SplitFetcher).
+        List<FileStoreSourceSplit> splits = 
pendingSplitAssignment.remove(subtask);
+        if (splits != null && splits.size() > 0) {
+            context.assignSplits(new 
SplitsAssignment<>(Collections.singletonMap(subtask, splits)));

Review Comment:
   Comment:
   The following batch assignment operation is for two purposes:
   1. To distribute splits evenly when batch reading to avoid failure of some 
tasks to read data caused by scheduling problems (for example, the current 
resource can only schedule part of the tasks).
   2. Optimize limit reading. In limit reading, the task will repeatedly create 
SplitFetcher to read the data of the limit number for each coming split (the 
limit status is in the SplitFetcher). So if the splits are divided too small, 
the task will cost more time on creating SplitFetcher and reading data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-table-store] yuzelin commented on a diff in pull request #563: [FLINK-31252] Improve StaticFileStoreSplitEnumerator to assign batch splits

Reply via email to