aokolnychyi commented on code in PR #5459:
URL: https://github.com/apache/iceberg/pull/5459#discussion_r944126591
##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -151,20 +175,34 @@ private static void deleteFiles(FileIO io,
Set<ManifestFile> allManifests) {
.run(
manifest -> {
try (ManifestReader<?> reader = ManifestFiles.open(manifest,
io)) {
+ List<String> pathsToDelete = Lists.newArrayList();
for (ManifestEntry<?> entry : reader.entries()) {
// intern the file path because the weak key map uses
identity (==) instead of
// equals
String path = entry.file().path().toString().intern();
Boolean alreadyDeleted = deletedFiles.putIfAbsent(path,
true);
if (alreadyDeleted == null || !alreadyDeleted) {
try {
- io.deleteFile(path);
+ if (io instanceof SupportsBulkOperations) {
Review Comment:
What about collecting files into a list in all cases and then checking what
type of `FileIO` we got? Hopefully, it should be a little bit more readable and
will avoid the instanceof call for each manifest entry, which is expensive.
```
if (io instanceof SupportsBulkOperations) {
SupportsBulkOperations bulkIO = (SupportsBulkOperations) io;
bulkDelete(bulkIO, pathsToDelete, "files");
} else {
for (String path : pathsToDelete) {
delete(io, path);
}
}
```
##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -102,6 +106,26 @@ public static void dropTableData(FileIO io, TableMetadata
metadata) {
deleteFiles(io, manifestsToDelete);
}
+ if (io instanceof SupportsBulkOperations) {
Review Comment:
I feel like there are three types of deletes. What about defining helper
methods for them that would abstract some common logic like try catch? The type
annotation is optional, it can probably be deduced from the file name.
```
private static void delete(FileIO io, String file, String type) {
try {
io.deleteFile(file);
} catch (RuntimeException e) {
LOG.warn("Failed to delete {}: {}", type, file, e);
}
}
private static void delete(FileIO io, Iterable<String> files, String type) {
Tasks.foreach(files)
.executeWith(ThreadPools.getWorkerPool())
.noRetry()
.suppressFailureWhenFinished()
.onFailure((file, exc) -> LOG.warn("Delete failed for {}: {}", type,
file, exc))
.run(io::deleteFile);
}
private static void bulkDelete(SupportsBulkOperations io, Iterable<String>
files, String type) {
try {
io.deleteFiles(files);
} catch (RuntimeException e) {
LOG.warn("Failed to bulk delete {}", type, e);
}
}
```
##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -102,6 +106,26 @@ public static void dropTableData(FileIO io, TableMetadata
metadata) {
deleteFiles(io, manifestsToDelete);
}
+ if (io instanceof SupportsBulkOperations) {
+ SupportsBulkOperations bulkFileIO = (SupportsBulkOperations) io;
+ Iterable<String> manifestPaths = Iterables.transform(manifestsToDelete,
ManifestFile::path);
+ Iterable<String> prevMetadataFiles =
+ Iterables.transform(metadata.previousFiles(),
TableMetadata.MetadataLogEntry::file);
+ try {
+ Iterable<String> filesToDelete =
+ Iterables.concat(
+ manifestPaths,
+ manifestListsToDelete,
+ prevMetadataFiles,
+ ImmutableList.of(metadata.metadataFileLocation()));
+ bulkFileIO.deleteFiles(filesToDelete);
Review Comment:
I feel like we should either follow what we do for non-bulk deletes and log
an error message per each type or adapt the code below for non-bulk deletes to
delete all files in one go.
```
if (io instanceof SupportsBulkOperations) {
SupportsBulkOperations bulkIO = (SupportsBulkOperations) io;
bulkDelete(bulkIO, manifests, "manifests");
bulkDelete(bulkIO, manifestLists, "manifest lists");
bulkDelete(bulkIO, metadataFiles, "metadata files");
} else {
delete(io, manifests, "manifest");
delete(io, manifestLists, "manifest list");
delete(io, metadataFiles, "metadata file");
}
```
Or
```
if (io instanceof SupportsBulkOperations) {
SupportsBulkOperations bulkIO = (SupportsBulkOperations) io;
bulkDelete(bulkIO, files);
} else {
delete(io, files);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]