sumeetgajjar opened a new pull request, #4817:
URL: https://github.com/apache/iceberg/pull/4817

   ## What changes are proposed in this PR?
   
   This PR adds a new Interface `OrphanFileStatus` that indicates if an orphan 
file was deleted or not. In cases of failure during file deletion, It provides 
a reference to the encountered exception.
   
   With this additional information, a user can choose to record this failure 
and retry the deletion process in a controlled fashion in the future.
   
   ## Why are the changes are needed?
   
   During the execution of the DeleteOrphanFiles spark action or 
remove_orphan_files SQL procedure, a failure encountered during deletion of the 
orphan file is not bubbled up to the user nor is there any indication of the 
failure in the returned result.
   
   The return value of the Spark action is Iterable<String> and the SQL 
procedure simply displays a list of orphan files in a table. With this limited 
information, the only way for the user to know if the orphan file was deleted 
or not is to
   - grep the logs for the warning
   - for each of the locations returned in the iterable, query the cloud 
storage for its existence
   
   
https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java#L225-L230
   
   
https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/core/src/main/java/org/apache/iceberg/actions/BaseDeleteOrphanFilesActionResult.java#L31
   
   
https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java#L57-L59
   
   One can re-run the expensive delete action to delete the files that failed 
during the previous delete action run, however, that is not very desirable 
since re-listing the dir contents to identify the orphan files would result in 
duplicate work plus wastage of precious API calls. One of the common causes of 
delete failure in the public cloud is hitting the API quotas and unnecessary 
re-runs of the delete action can lead to a temporary denial of access to other 
workloads accessing the same storage resource.
   
   ## Output
   
   ### Before this change
   
   #### DryRun => true/false
   ```bash
   scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', 
older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => 
true)").show(false)
   
+--------------------------------------------------------------------------------------------------------------------------+
   |orphan_file_location                                                        
                                              |
   
+--------------------------------------------------------------------------------------------------------------------------+
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|
   
+--------------------------------------------------------------------------------------------------------------------------+
   ```
   
   ### After this change
   
   #### DryRun => true
   ```bash
   scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', 
older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => 
true)").show(false)
   
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
   |orphan_file_location                                                        
                                              |deleted|error_message|
   
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false
   |null         |
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false
   |null         |
   
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
   ```
   
   #### DryRun => false
   ```bash
   scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', 
older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => 
false)").show(false)
   
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
   |orphan_file_location                                                        
                                              |deleted|error_message|
   
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true
   |null         |
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true
   |null         |
   
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
   ```
   
   #### DryRun => false, simulating failure during delete
   ```bash
   
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
   |orphan_file_location                                                        
                                              |deleted|error_message            
              |
   
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-615125af-bfbb-4279-b22b-559dee4f0b13-c000.snappy.parquet|true
   |null                                   |
   
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|false
  |simulating failure during file deletion|
   
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to