RussellSpitzer commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876111064


   ExpireSnapshotActions will always only remove files that were previously 
part of the Iceberg Table.
   RemoveOrphanFiles will remove those files and any other files that are not 
explicitly part of the table.
   
   Let's take an example
   
   I have a Directory /myTable/
   
   I have preformed several commits
   
   ```
   1. Add File A, B. - Table is A, B
   2. Remove File A, Add File C. - Table is B, C
   3. Remove File C, Add File D - Table is B, D
   ```
   
   But lets say we had a bit of an error and we also wrote BrokenFile in a run 
that got canceled, it was never added to the table but the file was created.
   
   The directory now has five files from Iceberg (and metadata files)
   
   ```
   A, B , C, D and BrokenFile
   ```
   
   If I run expire snapshots and remove snapshot 1 then it checks for all files 
that were referred to by 
   snapshot 1 (A, B)  that were not referred to by Snapshot 2 (B, C) or 
Snapshot 3 (B, D). 
   This is a single file, 
   
   ```
   (A, B) but not (B, C, D) = (A). 
   ```
   
   So only File A should be deleted (along with metadata for Snapshot 2 and 
Snapshot 3).
   
   If I run expire snapshots and remove snapshot 2 and 1 we do a similar thing.
   
   ```
   (A, B, C) but not (B, D) = (A, C)
   ```
   
   So in this case expire snapshots removes two data files, A and C.
   
   Neither of these operations were able to remove "BrokenFile" it was never 
listed, so it can never be picked up by this operation. Let's say that we 
expired Snapshots 2 and 1, but the delete operation failed so the files were 
never removed.
   
   Now the table just looks like
   
   ```
   3. Remove C, Add D - Table is B, D
   ```
   
   But our directory still has A, B, C, D and Broken File
   
   Remove Orphan Files can clean this up for us because it does not use the 
snapshots which are removed to determine which files to remove. Instead it 
lists all the files in the table location (A, B, C, D, BrokenFile) and
   then deletes all files which are not referenced by the table. Currently the 
table only has 1 snapshot, snapshot 3 (B,D).
   ```
   (A, B, C, D, Broken File) but not (B, D) = (A, C, BrokenFile)
   ```
   So RemoveOrphanFiles will remove 3 files. A, C and BrokenFile
   
   
   So this is why I consider Remove Orphan Files to be a superset of what 
ExpireSnapshot removes. I guess it is more correct to say RemoveOrphanFiles 
will remove all files that should have been removed after ExpireSnapshots even 
if ExpireSnpashots fails to delete those files for some reason. 
   
   Expire snapshots removes the history and then all the files which were only 
reachable by that history. Remove OrphanFiles looks at all of the current 
history and compares it to a raw directory listing. Remove orphan files can 
clean up a failed ExpireSnapshots, but not the reverse. Remove orphan files is 
more dangerous because there is a possibility that your table location has 
files from other projects or the paths have changed in some subtle way that 
matches on resolution but does not string match. For example if you store paths 
without authority and change authorities you may have files which are the same, 
but do not string match correctly.
   
   -----
   
   TLDR; 
   
   Expire Snapshots will *never* remove a file that Iceberg will need to read 
the table with a small caveat for tables which are created as Snapshots of 
other tables. See GC_ENABLED in table properties.
   
   RemoveOrphanFiles *should never* remove a file that Iceberg will need to 
read the table, but since it uses string matching to determine which files to 
remove there is a chance it can remove necessary files which is why a dry-run 
flag was introduced. It also will remove any other files in the table location 
even if they were never part of the Iceberg table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to