JeremyXin commented on code in PR #8453:
URL: https://github.com/apache/seatunnel/pull/8453#discussion_r1903592674
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/split/FileSourceSplitEnumerator.java:
##########
@@ -107,8 +110,7 @@ private void assignSplit(int taskId) {
context.assignSplit(taskId, currentTaskSplits);
// save the state of assigned splits
assignedSplit.addAll(currentTaskSplits);
- // remove the assigned splits from pending splits
- currentTaskSplits.forEach(split -> pendingSplit.remove(split));
Review Comment:
Because the principle of the poll-based file allocation strategy I consider
is based on the location of the split in the collection of pendingSplit files,
using assignCount and parallelism modulus to determine which task should be
assigned to the split. The premise of this method is that the pendingSplit set
remains unchanged during allocation. If you remove allocated splits from the
pendingSplit collection, it can cause a change in the location of other splits,
potentially changing their owners and causing problems.
I know that the original purpose of adding this line of code is to prevent
the double allocation of split, but in my opinion, the submitted code
calculates the module based on assignCount and parallelism, and only when the
result is consistent with the taskId value, the split is assigned to the
corresponding task, which will not cause the double allocation of files. It was
checked in unit tests.
So that's why I deleted this line of code. If I need to keep it, I can try
to redesign the code flow. If you have any questions, please contact me in
time. Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]