RussellSpitzer commented on code in PR #4759:
URL: https://github.com/apache/iceberg/pull/4759#discussion_r902804115
##########
core/src/main/java/org/apache/iceberg/actions/SortStrategy.java:
##########
@@ -80,12 +118,55 @@ public RewriteStrategy options(Map<String, String>
options) {
return this;
}
+ @Override
+ public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask>
dataFiles) {
+ if (rewriteAll()) {
+ LOG.info("Table {} set to rewrite all data files", table().name());
+ return dataFiles;
+ } else {
+ // Remove files that are completely sorted.
+ // Example: File_A(1, 10), File_B(11, 25), File_C(15, 30), File_D(31, 40)
+ // Then only File_B and File_C are selected
+ Iterable<FileScanTask> selectedFiles =
SortStrategyUtil.removeSortedFiles(dataFiles, sortOrder);
+ if (!haveGoodFileSizes(selectedFiles) || !areFilesSorted(selectedFiles))
{
+ // Rewrite all selected files if they are mis-sized or have bad
sortedness score
+ return selectedFiles;
+ }
+ return ImmutableList.of();
+ }
+ }
+
+ @Override
+ public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask>
dataFiles) {
+ ListPacker<FileScanTask> packer = new
BinPacking.ListPacker<>(maxGroupSize(), 1, false);
Review Comment:
In We may have a slight issue here as this bin packing algorithm may group
together files which are not actually adjacent. For example say I mark files
A, B, X, Y ,Z
Where A and B overlap and X, Y , Z overlap
And then make groups
A,Y,Z and B,X
Sorting and writing B,X will have no effect
And sorting and rewriting a,y,z will just create new files which potentially
overlap with B and X
So in this case we probably need to sort our files or cluster them together
based on common overlaps. I think we may be able to hold off on clustering till
a future improvement but for now we should probably at least attempt to sort
this list based on start offsets?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]