RussellSpitzer commented on code in PR #4759:
URL: https://github.com/apache/iceberg/pull/4759#discussion_r902804115


##########
core/src/main/java/org/apache/iceberg/actions/SortStrategy.java:
##########
@@ -80,12 +118,55 @@ public RewriteStrategy options(Map<String, String> 
options) {
     return this;
   }
 
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> 
dataFiles) {
+    if (rewriteAll()) {
+      LOG.info("Table {} set to rewrite all data files", table().name());
+      return dataFiles;
+    } else {
+      // Remove files that are completely sorted.
+      // Example: File_A(1, 10), File_B(11, 25), File_C(15, 30), File_D(31, 40)
+      // Then only File_B and File_C are selected
+      Iterable<FileScanTask> selectedFiles = 
SortStrategyUtil.removeSortedFiles(dataFiles, sortOrder);
+      if (!haveGoodFileSizes(selectedFiles) || !areFilesSorted(selectedFiles)) 
{
+        // Rewrite all selected files if they are mis-sized or have bad 
sortedness score
+        return selectedFiles;
+      }
+      return ImmutableList.of();
+    }
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> 
dataFiles) {
+    ListPacker<FileScanTask> packer = new 
BinPacking.ListPacker<>(maxGroupSize(), 1, false);

Review Comment:
   In We may have a slight issue here as this bin packing algorithm may group 
together files which are not actually adjacent. For example say I mark files
   
   A, B, X, Y ,Z
   
   Where A and B overlap and X, Y , Z overlap
   
   And then make groups
   
   A,Y,Z and B,X
   
   Sorting and writing B,X will have no effect
   And sorting and rewriting a,y,z will just create new files which potentially 
overlap with B and X
   
   So in this case we probably need to sort our files or cluster them together 
based on common overlaps. I think we may be able to hold off on clustering till 
a future improvement but for now we should probably at least attempt to sort 
this list based on start offsets?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to