steveloughran commented on code in PR #15436:
URL: https://github.com/apache/iceberg/pull/15436#discussion_r2918964163
##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -173,8 +185,50 @@ public void deletePrefix(String prefix) {
}
}
+ /**
+ * Is HadoopFileIO configured to use the Hadoop bulk delete API?
+ *
+ * @return true if the Bulkdeleter should be used.
+ */
+ @VisibleForTesting
+ boolean useBulkDeleteApi() {
+ if (useBulkDelete == null) {
+ useBulkDelete = conf().getBoolean(BULK_DELETE_ENABLED,
DEFAULT_BULK_DELETE_ENABLED);
+ }
+ return useBulkDelete;
+ }
+
+ /**
+ * Delete files.
+ *
+ * <p>If the Hadoop bulk deletion API is enabled, this API is used through
{@link BulkDeleter}.
+ * Otherwise, each file is deleted individually in the thread pool.
+ *
+ * @param pathsToDelete The paths to delete
+ * @throws BulkDeletionFailureException failure to delete one or more files.
+ * @throws IllegalStateException if bulk delete is enabled but the hadoop
runtime does not support
+ * it
+ */
@Override
public void deleteFiles(Iterable<String> pathsToDelete) throws
BulkDeletionFailureException {
+ if (useBulkDeleteApi()) {
+ // bulk delete.
+ Preconditions.checkState(
+ BulkDeleter.apiAvailable(),
+ "Bulk delete has been enabled but is not present within the current
hadoop library. "
+ + "Review the value of "
+ + BULK_DELETE_ENABLED);
+ // this call would trigger CNFE on spark 3.5; the probe above is to
generate a more meaningful
+ // error message with instructions.
+ final int count =
+ new BulkDeleter(executorService(),
getConf()).bulkDeleteFiles(pathsToDelete);
+ if (count != 0) {
+ throw new BulkDeletionFailureException(count);
Review Comment:
not going to attempt that.
I had explored that with the first patch, but consider this: what failures
can happen in a bulk delete which a classic delete(path, false) can't?
For any FS without a custom implementation, the answer is "none". For S3A we
will actually miss detecting and reporting on passing in a directory tree
-this'll become a no-op. Iceberg never tries to do that, so it's moot.
That just leaves transients: network etc, which should be handled, and
permissions, *which aren't going to recover*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]