Re: [PR] [core] Refactor DataEvolutionSplitGenerator.split to do sort on demand [paimon]

via GitHub Wed, 29 Oct 2025 07:29:30 -0700


JingsongLi commented on code in PR #6496:
URL: https://github.com/apache/paimon/pull/6496#discussion_r2473113325



##########
paimon-core/src/main/java/org/apache/paimon/table/source/DataEvolutionSplitGenerator.java:
##########
@@ -99,90 +101,57 @@ public static <T> List<List<T>> split(
             List<T> files,
             Function<T, String> fileNameF,
             Function<T, Long> firstRowIdF,
-            Function<T, Long> rowCountF,
-            Function<T, Long> maxSequenceNumberF) {
-        List<List<T>> splitByRowId = new ArrayList<>();
-        // Sort files by firstRowId and then by maxSequenceNumber
-        files.sort(
-                Comparator.comparingLong(
-                                (ToLongFunction<T>)
-                                        value ->
-                                                firstRowIdF.apply(value) == 
null
-                                                        ? Long.MIN_VALUE
-                                                        : 
firstRowIdF.apply(value))
-                        .thenComparingInt(f -> isBlobFile(fileNameF.apply(f)) 
? 1 : 0)
-                        .thenComparing(
-                                (f1, f2) -> {
-                                    // If firstRowId is the same, we should 
read the file with
-                                    // larger sequence number first. Because 
larger sequence number
-                                    // file is more fresh
-                                    return Long.compare(
-                                            maxSequenceNumberF.apply(f2),
-                                            maxSequenceNumberF.apply(f1));
-                                }));
-
-        files = filterBlob(files, fileNameF, firstRowIdF, rowCountF);
-
-        // Split files by firstRowId
-        long lastRowId = -1;
-        long checkRowIdStart = 0;
-        List<T> currentSplit = new ArrayList<>();
-        for (int i = 0; i < files.size(); i++) {
-            T file = files.get(i);
-            Long firstRowId = firstRowIdF.apply(file);
-            if (firstRowId == null) {
-                splitByRowId.add(Collections.singletonList(file));
-                continue;
-            }
-            if (!isBlobFile(fileNameF.apply(file)) && firstRowId != lastRowId) 
{
-                if (!currentSplit.isEmpty()) {
-                    splitByRowId.add(currentSplit);
-                }
-                if (firstRowId < checkRowIdStart) {
-                    throw new IllegalStateException(
-                            String.format(
-                                    "There are overlapping files in the split: 
\n %s, the wrong file is: \n %s",
-                                    files.subList(Math.max(0, i - 20), 
i).stream()
-                                            .map(Object::toString)
-                                            .collect(Collectors.joining(",")),
-                                    file));
+            ToLongFunction<T> rowCountF,
+            ToLongFunction<T> maxSeqF) {
+        // group by row id range
+        ToLongFunction<T> firstRowIdFunc =
+                t -> {
+                    Long firstRowId = firstRowIdF.apply(t);
+                    checkArgument(
+                            firstRowId != null, "File %s First row id should 
not be null.", t);
+                    return firstRowId;
+                };
+        ToLongFunction<T> endRowIdF =
+                t -> firstRowIdFunc.applyAsLong(t) + rowCountF.applyAsLong(t) 
- 1;
+        RangeHelper<T> rangeHelper = new RangeHelper<>(firstRowIdFunc, 
endRowIdF);
+        List<List<T>> result = rangeHelper.mergeOverlappingRanges(files);
+
+        // skip group which only has blob files
+        // TODO Why skip? When will this situation occur?
+        result.removeIf(next -> next.stream().allMatch(t -> 
isBlobFile(fileNameF.apply(t))));
+
+        // in group, sort by blob file and max_seq

Review Comment:
   Why we need to sort in group? Why not do this in `DataEvolutionSplitRead`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [core] Refactor DataEvolutionSplitGenerator.split to do sort on demand [paimon]

Reply via email to