amogh-jahagirdar commented on code in PR #4578:
URL: https://github.com/apache/iceberg/pull/4578#discussion_r886339778


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -161,21 +176,143 @@ public List<Snapshot> apply() {
 
   private TableMetadata internalApply() {
     this.base = ops.refresh();
+    if (base.snapshots().isEmpty()) {
+      return base;
+    }
 
     Set<Long> idsToRetain = Sets.newHashSet();
-    List<Long> ancestorIds = SnapshotUtil.ancestorIds(base.currentSnapshot(), 
base::snapshot);
-    if (minNumSnapshots >= ancestorIds.size()) {
-      idsToRetain.addAll(ancestorIds);
-    } else {
-      idsToRetain.addAll(ancestorIds.subList(0, minNumSnapshots));
+
+    // Compute the snapshots for each reference
+    Map<SnapshotRef, Set<Long>> refSnapshots = 
computeRefSnapshots(base.refs().values());
+
+    // Identify unreferenced snapshots which should be retained
+    Set<Long> unreferencedSnapshotsToRetain = 
computeUnreferencedSnapshotsToRetain(refSnapshots);

Review Comment:
   Yeah I'm still thinking if there's a way to simplify this. 
   
   I think we'll need the structure for maintaining what are the referenced 
snapshots ("referenced" meaning for a branch all it's ancestors, and for a tag, 
the snapshot it refers to). I don't think we can construct the snapshot 
retention set and then after that look through all the table's snapshots , and 
correctly derive which snapshots are the unreferenced snapshots which should 
still be retained based on expiration age.
   
   The reason is some of the table's snapshots may no longer need to be 
retained due to branch retention policy but if we go through those snapshots 
and they're still within defaultExpirationAge we'd retain them which I don't 
think we want since the branch has it's own lifecycle.  I think if a snapshot 
was part of a branch which expired due to it's own retention policy then we 
don't want to retain it on the basis of the defaultExpireOlderThan.   
   
   **In other words, in this approach I don't think we can differentiate 
between a snapshot which should actually be removed based on branch retention 
or if it's a snapshot which was never part of any reference.**
   
   I think we need to be able to identify what are the snapshots which are not 
part of any branch lineage or are tagged and for that we need a set of all the 
snapshots which encapsulate all the current branch lineages. We do the grouping 
so that it can be reused when determining which branch snapshots to keep (so 
that the ancestors don't have to be iterated through again). Let me know what 
you think @rdblue  !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to