findepi commented on code in PR #14921:
URL: https://github.com/apache/iceberg/pull/14921#discussion_r2667486407
##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -259,32 +261,44 @@ public void cleanFiles(
});
if (ExpireSnapshots.CleanupLevel.ALL == cleanupLevel) {
- Set<String> filesToDelete =
+ Set<FileInfo> filesToDelete =
findFilesToDelete(
manifestsToScan, manifestsToRevert, validIds,
beforeExpiration.specsById());
- LOG.debug("Deleting {} data files", filesToDelete.size());
- deleteFiles(filesToDelete, "data");
+ Map<FileContent, Set<String>> groupedFilesToDelete =
+ filesToDelete.stream()
+ .collect(
+ Collectors.groupingBy(
+ FileInfo::getContent,
+ Collectors.mapping(FileInfo::getPath,
Collectors.toSet())));
+
+ for (Map.Entry<FileContent, Set<String>> entry :
groupedFilesToDelete.entrySet()) {
Review Comment:
Why group now? It seems orthogonal to collecting summary
##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -141,6 +236,46 @@ protected Set<String> expiredStatisticsFilesLocations(
return Sets.difference(statsFileLocationsBeforeExpiration,
statsFileLocationsAfterExpiration);
}
+ protected static class FileInfo {
Review Comment:
It's a 3rd `FileInfo` class. Can this one have a more distinct name?
##########
core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java:
##########
@@ -208,8 +226,12 @@ private Set<String> findFilesToDelete(
}
// Remove all the live files from the candidate deletion set
- try (CloseableIterable<String> paths =
ManifestFiles.readPaths(manifest, fileIO)) {
- paths.forEach(filesToDelete::remove);
+ try (CloseableIterable<DataFile> entries =
+ ManifestFiles.readColumns(
+ manifest, fileIO, ImmutableList.of("content",
"file_path"))) {
Review Comment:
Should `ImmutableList.of("content", "file_path")` be a static constant in
this class?
##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -42,6 +46,9 @@ public void accept(String file) {
};
private static final Logger LOG =
LoggerFactory.getLogger(FileCleanupStrategy.class);
+ protected static final String MANIFEST = "manifest";
+ protected static final String MANIFEST_LIST = "manifest list";
+ protected static final String STATISTICS_FILES = "statistics files";
Review Comment:
Can this be an enum?
##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -124,7 +134,92 @@ protected void deleteFiles(Set<String> pathsToDelete,
String fileType) {
.suppressFailureWhenFinished()
.onFailure(
(file, thrown) -> LOG.warn("Delete failed for {} file: {}",
fileType, file, thrown))
- .run(deleteFuncToUse::accept);
+ .run(
+ file -> {
+ deleteFuncToUse.accept(file);
+ summary.deletedFile(fileType);
+ });
+ }
+ }
+
+ static class DeleteSummary {
+ private final AtomicLong dataFilesCount = new AtomicLong(0L);
+ private final AtomicLong positionDeleteFilesCount = new AtomicLong(0L);
+ private final AtomicLong equalityDeleteFilesCount = new AtomicLong(0L);
+ private final AtomicLong manifestsCount = new AtomicLong(0L);
+ private final AtomicLong manifestListsCount = new AtomicLong(0L);
+ private final AtomicLong statisticsFilesCount = new AtomicLong(0L);
+
+ public void deletedFiles(String type, int numFiles) {
+ if (FileContent.DATA.name().equalsIgnoreCase(type)) {
+ dataFilesCount.addAndGet(numFiles);
+
+ } else if (FileContent.POSITION_DELETES.name().equalsIgnoreCase(type)) {
+ positionDeleteFilesCount.addAndGet(numFiles);
+
+ } else if (FileContent.EQUALITY_DELETES.name().equalsIgnoreCase(type)) {
+ equalityDeleteFilesCount.addAndGet(numFiles);
+
+ } else if (MANIFEST.equalsIgnoreCase(type)) {
+ manifestsCount.addAndGet(numFiles);
+
+ } else if (MANIFEST_LIST.equalsIgnoreCase(type)) {
+ manifestListsCount.addAndGet(numFiles);
+
+ } else if (STATISTICS_FILES.equalsIgnoreCase(type)) {
+ statisticsFilesCount.addAndGet(numFiles);
+
+ } else {
+ throw new ValidationException("Illegal file type: %s", type);
+ }
+ }
+
+ public void deletedFile(String type) {
Review Comment:
Can this call `deletedFiles(type, 1)` to avoid duplicate logic?
##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -141,6 +236,46 @@ protected Set<String> expiredStatisticsFilesLocations(
return Sets.difference(statsFileLocationsBeforeExpiration,
statsFileLocationsAfterExpiration);
}
+ protected static class FileInfo {
+ private final FileContent content;
+ private final String path;
+
+ public FileInfo(FileContent content, String path) {
+ this.content = content;
+ this.path = path;
Review Comment:
While this project code style is generally null friendly, we don't need to
adhere to this in new code.
I would use `Preconditions.checkNotNull` here (and then maybe put `@Nonnull`
on the getters)
##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -320,12 +336,14 @@ private Set<String> findFilesToDelete(
.run(
manifest -> {
// the manifest has deletes, scan it to find files to delete
- try (ManifestReader<?> reader = ManifestFiles.open(manifest,
fileIO, specsById)) {
- for (ManifestEntry<?> entry : reader.entries()) {
+ try (ManifestReader<? extends ContentFile<?>> reader =
+ ManifestFiles.open(manifest, fileIO, specsById)) {
+ for (ManifestEntry<? extends ContentFile<?>> entry :
reader.entries()) {
// delete any ADDED file from manifests that were reverted
if (entry.status() == ManifestEntry.Status.ADDED) {
// use toString to ensure the path will not change (Utf8
is reused)
Review Comment:
Pre-existing. This is obsolete. Would be nice to remove. (same for same
comment above)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]