anuragmantri commented on a change in pull request #4433:
URL: https://github.com/apache/iceberg/pull/4433#discussion_r838689504
##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -48,14 +48,6 @@
public abstract class SortStrategy extends BinPackStrategy {
private static final Logger LOG =
LoggerFactory.getLogger(SortStrategy.class);
- /**
- * Rewrites all files, regardless of their size. Defaults to false,
rewriting only mis-sized
- * files;
- */
- public static final String REWRITE_ALL = "rewrite-all";
- public static final boolean REWRITE_ALL_DEFAULT = false;
-
-
private static final Set<String> validOptions = ImmutableSet.of(
REWRITE_ALL
);
Review comment:
This is redundant and can be removed. Same with the line
`.addAll(validOptions)` in `validOptions()` in this file.
##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -140,28 +149,39 @@ public RewriteStrategy options(Map<String, String>
options) {
DELETE_FILE_THRESHOLD,
DELETE_FILE_THRESHOLD_DEFAULT);
+ rewriteAll = PropertyUtil.propertyAsBoolean(options,
+ REWRITE_ALL,
+ REWRITE_ALL_DEFAULT);
+
validateOptions();
return this;
}
@Override
public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask>
dataFiles) {
- return FluentIterable.from(dataFiles)
- .filter(scanTask -> scanTask.length() < minFileSize ||
scanTask.length() > maxFileSize ||
- taskHasTooManyDeletes(scanTask));
+ if (rewriteAll) {
+ return dataFiles;
Review comment:
Maybe add a info log here similar to SortStrategy
`LOG.info("Sort Strategy for table {} set to rewrite all data files",
table().name());`?
##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -140,28 +149,39 @@ public RewriteStrategy options(Map<String, String>
options) {
DELETE_FILE_THRESHOLD,
DELETE_FILE_THRESHOLD_DEFAULT);
+ rewriteAll = PropertyUtil.propertyAsBoolean(options,
+ REWRITE_ALL,
+ REWRITE_ALL_DEFAULT);
+
validateOptions();
return this;
}
@Override
public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask>
dataFiles) {
- return FluentIterable.from(dataFiles)
- .filter(scanTask -> scanTask.length() < minFileSize ||
scanTask.length() > maxFileSize ||
- taskHasTooManyDeletes(scanTask));
+ if (rewriteAll) {
+ return dataFiles;
+ } else {
+ return FluentIterable.from(dataFiles)
+ .filter(scanTask -> scanTask.length() < minFileSize ||
scanTask.length() > maxFileSize ||
+ taskHasTooManyDeletes(scanTask));
+ }
}
@Override
public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask>
dataFiles) {
ListPacker<FileScanTask> packer = new
BinPacking.ListPacker<>(maxGroupSize, 1, false);
List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles,
FileScanTask::length);
-
- return potentialGroups.stream().filter(group ->
- (group.size() >= minInputFiles && group.size() > 1) ||
- (sizeOfInputFiles(group) > targetFileSize && group.size() > 1)
||
- sizeOfInputFiles(group) > maxFileSize ||
- group.stream().anyMatch(this::taskHasTooManyDeletes)
- ).collect(Collectors.toList());
+ if (rewriteAll) {
+ return potentialGroups;
Review comment:
Since we are moving the rewrite logic to `BinPackStrategy`, we can
remove redundant code in `SortStrategy`. For example, planFileGroups in
`SortStrategy` can be
```
@Override
public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask>
dataFiles) {
return super.planFileGroups(dataFiles);
}
```
##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -140,28 +149,39 @@ public RewriteStrategy options(Map<String, String>
options) {
DELETE_FILE_THRESHOLD,
DELETE_FILE_THRESHOLD_DEFAULT);
+ rewriteAll = PropertyUtil.propertyAsBoolean(options,
+ REWRITE_ALL,
+ REWRITE_ALL_DEFAULT);
+
Review comment:
Since we are adding this option in BinPackStrategy, it's redundant to do
so in `SortStrategy`
[here](https://github.com/apache/iceberg/blob/61fe9b1ce3ecff4354bb7624c9b5e3f76df2221b/core/src/main/java/org/apache/iceberg/actions/SortStrategy.java#L97).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]