mehtaashish23 commented on a change in pull request #802: Update 
RemoveSnapshots to protect cherry-picked data files
URL: https://github.com/apache/incubator-iceberg/pull/802#discussion_r381139425
 
 

 ##########
 File path: core/src/main/java/org/apache/iceberg/RemoveSnapshots.java
 ##########
 @@ -188,16 +189,33 @@ private void cleanExpiredFiles(List<Snapshot> snapshots, 
Set<Long> validIds, Set
     // physically deleting files that were logically deleted in a commit that 
was rolled back.
     Set<Long> ancestorIds = 
Sets.newHashSet(SnapshotUtil.ancestorIds(base.currentSnapshot(), 
base::snapshot));
 
 Review comment:
   @rdblue I can see that this PR handles almost all scenarios in the 
current/first iteration (when you are expiring some or any snapshot), but I am 
trying to understand the run following the current iteration, what is the 
output of this (SnapshotUtil.ancestorIds) API (on subsequent runs), after we 
expire one of the ancestorIds in the current run?
   
   Based on what I can understand, it will return the ancestors based of off 
current snapshot, till the time we can find the parent snapshot in snapshot 
list, and since we have expired a particular ancestor (in previous run) the 
list will be incomplete. Right? If yes, then I suppose, we can't support APIs 
like expireSnapshotId, for selectively expiring snapshots, as it breaks lineage.
   I gave a thought, and found out, that we can't support id based expiring, 
instead we can only do time-based expiry. That will elevate the issue, where in 
(See below example) I can expire C (cherrypicked from B), and then later decide 
to expire B, which might delete underlying dataFiles of B referred by C, which 
is expired in previous run. Let me know, how this PR solves that? 
   //A -- C (B) -- D
   //   `- B
   
   Correct me if I am wrong, I am still trying to understand nuances of 
snapshot list maintenance in Iceberg.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to