RussellSpitzer commented on code in PR #4759:
URL: https://github.com/apache/iceberg/pull/4759#discussion_r902805939


##########
core/src/main/java/org/apache/iceberg/actions/SortStrategy.java:
##########
@@ -80,12 +118,55 @@ public RewriteStrategy options(Map<String, String> 
options) {
     return this;
   }
 
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> 
dataFiles) {
+    if (rewriteAll()) {
+      LOG.info("Table {} set to rewrite all data files", table().name());
+      return dataFiles;
+    } else {
+      // Remove files that are completely sorted.
+      // Example: File_A(1, 10), File_B(11, 25), File_C(15, 30), File_D(31, 40)
+      // Then only File_B and File_C are selected
+      Iterable<FileScanTask> selectedFiles = 
SortStrategyUtil.removeSortedFiles(dataFiles, sortOrder);
+      if (!haveGoodFileSizes(selectedFiles) || !areFilesSorted(selectedFiles)) 
{
+        // Rewrite all selected files if they are mis-sized or have bad 
sortedness score
+        return selectedFiles;
+      }
+      return ImmutableList.of();
+    }
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> 
dataFiles) {
+    ListPacker<FileScanTask> packer = new 
BinPacking.ListPacker<>(maxGroupSize(), 1, false);
+    return packer.pack(dataFiles, FileScanTask::length);
+  }
+
   protected void validateOptions() {
+    Preconditions.checkArgument(misSizedRatioThreshold >= 0 && 
misSizedRatioThreshold <= 1,
+        "mis-sized-ratio-threshold value must be between 0.0 and 1.0");
+
+    Preconditions.checkArgument(sortednessScoreThreshold >= 0 && 
sortednessScoreThreshold <= 1,
+        "sortedness-score-threshold value must be between 0.0 and 1.0");
+
     Preconditions.checkArgument(!sortOrder.isUnsorted(),
         "Can't use %s when there is no sort order, either define table %s's 
sort order or set sort" +
             "order in the action",
         name(), table().name());
 
     SortOrder.checkCompatibility(sortOrder, table().schema());
   }
+
+  boolean haveGoodFileSizes(Iterable<FileScanTask> dataFiles) {

Review Comment:
   I don't think we should check file size if we are also checking "sortedness" 
but if we are I think we should just use the binpack rules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to