[GitHub] [iceberg] jackye1995 commented on a change in pull request #3292: Spark: Compact Medium Size Files (#460)

GitBox Fri, 03 Dec 2021 12:24:36 -0800


jackye1995 commented on a change in pull request #3292:
URL: https://github.com/apache/iceberg/pull/3292#discussion_r762221357




##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -61,6 +62,39 @@ public static boolean hasDeletes(FileScanTask task) {
     return CloseableIterable.combine(splitTasks, tasks);
   }
 
+  /**
+   * Split files into FileScanTasks which only contain a single offset 
(rowGroup). For files which do not
+   * expose the offsets, use the normal split code.
+   * @param tasks Scan tasks, one per whole file to be split
+   * @param fallbackSplitSize the splitSize to use when the file does not 
contain explicit offsets to use
+   * @return Scan tasks, one per offset
+   */
+  public static CloseableIterable<FileScanTask> 
splitOnOffsets(CloseableIterable<FileScanTask> tasks,
+                                                               long 
fallbackSplitSize) {
+    Preconditions.checkArgument(fallbackSplitSize > 0,
+        "Invalid fallback split size (negative or 0): %s", fallbackSplitSize);
+
+    Iterable<FileScanTask> splitTasks = FluentIterable
+        .from(tasks)
+        .transformAndConcat(input -> {
+          DataFile file = input.file();
+          if (file.format().hasOffsets()) {
+            if (file.splitOffsets() != null) {
+              // Split on offsets, size 0 means 1 task per offset
+              return input.split(0);
+            } else {
+              // File too small to have offsets, take the entire file as a task

Review comment:
       Just want to combine the original split code here to have a general 
view, basically we get:
   
   ```java
   if (file.format().hasOffsets()) {
       if (file.splitOffsets() != null) {
           // split to 1 task per offset
       } else {
          // file too small, use 1 task for entire file
       }
   } else {
       if (file.format().isSplittable()) {
           if (file.splitOffsets() != null) {
               // offset aware split
           } else {
               // fixed size split
           }
       } else {
           // use 1 task for entire file
       }
   }
   ```
   
   So the case of `hasOffset && splitOffsets == null` case is basically a 
short-circuit even if the file is splittable, there is still some value for 
that.
   
   The question I have is that would it be worth directly adding the logic to 
the original splitFiles, instead of having 2 different methods? I imagine this 
would also help reducing the number of file scan tasks generated during scan 
planning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3292: Spark: Compact Medium Size Files (#460)

Reply via email to