RussellSpitzer commented on issue #3496:
URL: https://github.com/apache/iceberg/issues/3496#issuecomment-963156998


   The default behavior of the Action is to remove the files which are no 
longer live. The difference between the table api and the action, is that the 
Action determines the set of files to be removed by using a distributed job 
while the table api does this calculation locally.
   
   The main reason behind writing the action was that the default api does not 
scale well for extremely large tables. To preserve the functionality of the 
original api we added the flag which causes the original api to just remove 
snapshots and not delete files. The Action then takes the difference in state 
between before and after running the api, and uses that information to delete 
the files.
   
   This is explained in the Java Doc for the Class
   
https://github.com/apache/iceberg/blob/9b285d049c094ca6ee717e159249dee36d118894/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L54-L67
   
   Here you can see the delete function is applied on the diff set result
   
   
https://github.com/apache/iceberg/blob/9b285d049c094ca6ee717e159249dee36d118894/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L214-L221
   
   The default delete action removes the files see
   
   
https://github.com/apache/iceberg/blob/9b285d049c094ca6ee717e159249dee36d118894/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L82-L89
   
   ---
   
   So I do not believe any changes are needed here. If a user did want to use a 
separate delete facility I would actually suggest they use 
   
   
https://github.com/apache/iceberg/blob/9b285d049c094ca6ee717e159249dee36d118894/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java#L154
   
   which was added explicitly for users who have some sort of distributed async 
delete solution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to