RussellSpitzer commented on a change in pull request #3292:
URL: https://github.com/apache/iceberg/pull/3292#discussion_r751534743
##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -61,6 +62,39 @@ public static boolean hasDeletes(FileScanTask task) {
return CloseableIterable.combine(splitTasks, tasks);
}
+ /**
+ * Split files into FileScanTasks which only contain a single offset
(rowGroup). For files which do not
+ * expose the offsets, use the normal split code.
+ * @param tasks Scan tasks, one per whole file to be split
+ * @param fallbackSplitSize the splitSize to use when the file does not
contain explicit offsets to use
+ * @return Scan tasks, one per offset
+ */
+ public static CloseableIterable<FileScanTask>
splitOnOffsets(CloseableIterable<FileScanTask> tasks,
+ long
fallbackSplitSize) {
+ Preconditions.checkArgument(fallbackSplitSize > 0,
+ "Invalid fallback split size (negative or 0): %s", fallbackSplitSize);
+
+ Iterable<FileScanTask> splitTasks = FluentIterable
+ .from(tasks)
+ .transformAndConcat(input -> {
+ DataFile file = input.file();
+ if (file.format().hasOffsets()) {
+ if (file.splitOffsets() != null) {
+ // Split on offsets, size 0 means 1 task per offset
+ return input.split(0);
+ } else {
+ // File too small to have offsets, take the entire file as a task
Review comment:
This is again where I wanted to differentiate between globally
splittable and only splittable on offsets. Originally I just used the base
method but then I have to pass through the file size to get a single split
which seemed odd. We have to pass through the size because we go through the
non-offset split path which can possibly give us multiple splits (some of which
would be empty.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]