amogh-jahagirdar commented on code in PR #4578:
URL: https://github.com/apache/iceberg/pull/4578#discussion_r863946478


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -195,6 +299,18 @@ public void commit() {
     LOG.info("Committed snapshot changes");
 
     if (cleanExpiredFiles) {
+      TableMetadata updated = ops.refresh();
+      if (updated.refs() != null) {
+        List<SnapshotRef> branches = updated.refs()
+            .values().stream().filter(SnapshotRef::isBranch)
+            .collect(Collectors.toList());
+
+        if (branches.size() > 1) {
+          throw new UnsupportedOperationException(
+              "Deleting expired files when there is more than 1 branch is 
currently not supported");
+        }

Review Comment:
   @rdblue Thanks for the explanation on why the reachability analysis needs to 
be done and why without that we can't delete files if there are multiple 
branches.
   
   > However, one thing does need to be fixed: this check should happen before 
committing the changes. If we detect that we can't reliably clean up the table, 
we shouldn't drop the snapshots anyway. So this check should come first in the 
commit. 
   
   I was thinking about this before, I'm not sure if the operation to remove 
snapshots should be coupled with the ability to clean up the files. We know 
reliably which snapshots which should be removed based on the updated algorithm 
so it seems a bit unnecessarily constraining to fail the entire operation in 
case we can't do the deletion of files.
   
   Currently, if I'm not mistaken, deletion of the files themselves is already 
best effort and can fail, so it's already not a guarantee that we can reliably 
clean up the table. So following this, I think it makes sense to allow the 
removal of snapshot metadata and then prevent the deletion of files in the case 
of multiple branches. Right now we're preventing the operation by throwing an 
exception which as @kbendick pointed out, to a user would seem like the 
operation failed but it didn't from a snapshot removal perspective. Another way 
to handle this is to ignore the cleanFiles flag in case there are multiple 
branches and warn log that nothing is being deleted because of the current 
constraint (instead of throwing an exception). But that also does not seem 
clean since it's not particularly explicit. @kbendick @rdblue Let me know your 
thoughts! 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to