sumeetgajjar opened a new pull request, #4817: URL: https://github.com/apache/iceberg/pull/4817
## What changes are proposed in this PR? This PR adds a new Interface `OrphanFileStatus` that indicates if an orphan file was deleted or not. In cases of failure during file deletion, It provides a reference to the encountered exception. With this additional information, a user can choose to record this failure and retry the deletion process in a controlled fashion in the future. ## Why are the changes are needed? During the execution of the DeleteOrphanFiles spark action or remove_orphan_files SQL procedure, a failure encountered during deletion of the orphan file is not bubbled up to the user nor is there any indication of the failure in the returned result. The return value of the Spark action is Iterable<String> and the SQL procedure simply displays a list of orphan files in a table. With this limited information, the only way for the user to know if the orphan file was deleted or not is to - grep the logs for the warning - for each of the locations returned in the iterable, query the cloud storage for its existence https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java#L225-L230 https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/core/src/main/java/org/apache/iceberg/actions/BaseDeleteOrphanFilesActionResult.java#L31 https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java#L57-L59 One can re-run the expensive delete action to delete the files that failed during the previous delete action run, however, that is not very desirable since re-listing the dir contents to identify the orphan files would result in duplicate work plus wastage of precious API calls. One of the common causes of delete failure in the public cloud is hitting the API quotas and unnecessary re-runs of the delete action can lead to a temporary denial of access to other workloads accessing the same storage resource. ## Output ### Before this change #### DryRun => true/false ```bash scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => true)").show(false) +--------------------------------------------------------------------------------------------------------------------------+ |orphan_file_location | +--------------------------------------------------------------------------------------------------------------------------+ |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet| |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet| +--------------------------------------------------------------------------------------------------------------------------+ ``` ### After this change #### DryRun => true ```bash scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => true)").show(false) +--------------------------------------------------------------------------------------------------------------------------+-------+-------------+ |orphan_file_location |deleted|error_message| +--------------------------------------------------------------------------------------------------------------------------+-------+-------------+ |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false |null | |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false |null | +--------------------------------------------------------------------------------------------------------------------------+-------+-------------+ ``` #### DryRun => false ```bash scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => false)").show(false) +--------------------------------------------------------------------------------------------------------------------------+-------+-------------+ |orphan_file_location |deleted|error_message| +--------------------------------------------------------------------------------------------------------------------------+-------+-------------+ |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true |null | |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true |null | +--------------------------------------------------------------------------------------------------------------------------+-------+-------------+ ``` #### DryRun => false, simulating failure during delete ```bash +--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+ |orphan_file_location |deleted|error_message | +--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+ |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-615125af-bfbb-4279-b22b-559dee4f0b13-c000.snappy.parquet|true |null | |file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|false |simulating failure during file deletion| +--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
