[GitHub] [iceberg] rdblue commented on a change in pull request #3292: Spark: Compact Medium Size Files (#460)

GitBox Tue, 07 Dec 2021 08:45:04 -0800


rdblue commented on a change in pull request #3292:
URL: https://github.com/apache/iceberg/pull/3292#discussion_r764179536




##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -61,6 +62,39 @@ public static boolean hasDeletes(FileScanTask task) {
     return CloseableIterable.combine(splitTasks, tasks);
   }
 
+  /**
+   * Split files into FileScanTasks which only contain a single offset 
(rowGroup). For files which do not
+   * expose the offsets, use the normal split code.
+   * @param tasks Scan tasks, one per whole file to be split
+   * @param fallbackSplitSize the splitSize to use when the file does not 
contain explicit offsets to use
+   * @return Scan tasks, one per offset
+   */
+  public static CloseableIterable<FileScanTask> 
splitOnOffsets(CloseableIterable<FileScanTask> tasks,
+                                                               long 
fallbackSplitSize) {
+    Preconditions.checkArgument(fallbackSplitSize > 0,
+        "Invalid fallback split size (negative or 0): %s", fallbackSplitSize);
+
+    Iterable<FileScanTask> splitTasks = FluentIterable
+        .from(tasks)
+        .transformAndConcat(input -> {
+          DataFile file = input.file();
+          if (file.format().hasOffsets()) {
+            if (file.splitOffsets() != null) {
+              // Split on offsets, size 0 means 1 task per offset
+              return input.split(0);
+            } else {
+              // File too small to have offsets, take the entire file as a task

Review comment:
       I think that logic is incorrect. If a file doesn't have offsets, then we 
can't assume that it is small. It could just be that a writer didn't produce 
the offsets. The Parquet import code may or may not produce offsets, for 
example. We can't assume something about the file just because metadata is 
missing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3292: Spark: Compact Medium Size Files (#460)

Reply via email to