rdblue commented on a change in pull request #3292: URL: https://github.com/apache/iceberg/pull/3292#discussion_r764179536
########## File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java ########## @@ -61,6 +62,39 @@ public static boolean hasDeletes(FileScanTask task) { return CloseableIterable.combine(splitTasks, tasks); } + /** + * Split files into FileScanTasks which only contain a single offset (rowGroup). For files which do not + * expose the offsets, use the normal split code. + * @param tasks Scan tasks, one per whole file to be split + * @param fallbackSplitSize the splitSize to use when the file does not contain explicit offsets to use + * @return Scan tasks, one per offset + */ + public static CloseableIterable<FileScanTask> splitOnOffsets(CloseableIterable<FileScanTask> tasks, + long fallbackSplitSize) { + Preconditions.checkArgument(fallbackSplitSize > 0, + "Invalid fallback split size (negative or 0): %s", fallbackSplitSize); + + Iterable<FileScanTask> splitTasks = FluentIterable + .from(tasks) + .transformAndConcat(input -> { + DataFile file = input.file(); + if (file.format().hasOffsets()) { + if (file.splitOffsets() != null) { + // Split on offsets, size 0 means 1 task per offset + return input.split(0); + } else { + // File too small to have offsets, take the entire file as a task Review comment: I think that logic is incorrect. If a file doesn't have offsets, then we can't assume that it is small. It could just be that a writer didn't produce the offsets. The Parquet import code may or may not produce offsets, for example. We can't assume something about the file just because metadata is missing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org