RussellSpitzer commented on issue #2793: URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876111064
ExpireSnapshotActions will always only remove files that were previously part of the Iceberg Table. RemoveOrphanFiles will remove those files and any other files that are not explicitly part of the table. Let's take an example I have a Directory /myTable/ I have preformed several commits ``` 1. Add File A, B. - Table is A, B 2. Remove File A, Add File C. - Table is B, C 3. Remove File C, Add File D - Table is B, D ``` But lets say we had a bit of an error and we also wrote BrokenFile in a run that got canceled, it was never added to the table but the file was created. The directory now has five files from Iceberg (and metadata files) ``` A, B , C, D and BrokenFile ``` If I run expire snapshots and remove snapshot 1 then it checks for all files that were referred to by snapshot 1 (A, B) that were not referred to by Snapshot 2 (B, C) or Snapshot 3 (B, D). This is a single file, ``` (A, B) but not (B, C, D) = (A). ``` So only File A should be deleted (along with metadata for Snapshot 2 and Snapshot 3). If I run expire snapshots and remove snapshot 2 and 1 we do a similar thing. ``` (A, B, C) but not (B, D) = (A, C) ``` So in this case expire snapshots removes two data files, A and C. Neither of these operations were able to remove "BrokenFile" it was never listed, so it can never be picked up by this operation. Let's say that we expired Snapshots 2 and 1, but the delete operation failed so the files were never removed. Now the table just looks like ``` 3. Remove C, Add D - Table is B, D ``` But our directory still has A, B, C, D and Broken File Remove Orphan Files can clean this up for us because it does not use the snapshots which are removed to determine which files to remove. Instead it lists all the files in the table location (A, B, C, D, BrokenFile) and then deletes all files which are not referenced by the table. Currently the table only has 1 snapshot, snapshot 3 (B,D). ``` (A, B, C, D, Broken File) but not (B, D) = (A, C, BrokenFile) ``` So RemoveOrphanFiles will remove 3 files. A, C and BrokenFile So this is why I consider Remove Orphan Files to be a superset of what ExpireSnapshot removes. I guess it is more correct to say RemoveOrphanFiles will remove all files that should have been removed after ExpireSnapshots even if ExpireSnpashots fails to delete those files for some reason. Expire snapshots removes the history and then all the files which were only reachable by that history. Remove OrphanFiles looks at all of the current history and compares it to a raw directory listing. Remove orphan files can clean up a failed ExpireSnapshots, but not the reverse. Remove orphan files is more dangerous because there is a possibility that your table location has files from other projects or the paths have changed in some subtle way that matches on resolution but does not string match. For example if you store paths without authority and change authorities you may have files which are the same, but do not string match correctly. ----- TLDR; Expire Snapshots will *never* remove a file that Iceberg will need to read the table with a small caveat for tables which are created as Snapshots of other tables. See GC_ENABLED in table properties. RemoveOrphanFiles *should never* remove a file that Iceberg will need to read the table, but since it uses string matching to determine which files to remove there is a chance it can remove necessary files which is why a dry-run flag was introduced. It also will remove any other files in the table location even if they were never part of the Iceberg table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
