amogh-jahagirdar commented on code in PR #4578:
URL: https://github.com/apache/iceberg/pull/4578#discussion_r863946478
##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -195,6 +299,18 @@ public void commit() {
LOG.info("Committed snapshot changes");
if (cleanExpiredFiles) {
+ TableMetadata updated = ops.refresh();
+ if (updated.refs() != null) {
+ List<SnapshotRef> branches = updated.refs()
+ .values().stream().filter(SnapshotRef::isBranch)
+ .collect(Collectors.toList());
+
+ if (branches.size() > 1) {
+ throw new UnsupportedOperationException(
+ "Deleting expired files when there is more than 1 branch is
currently not supported");
+ }
Review Comment:
@rdblue Thanks for the explanation on why the reachability analysis needs to
be done and why without that we can't delete files if there are multiple
branches.
> However, one thing does need to be fixed: this check should happen before
committing the changes. If we detect that we can't reliably clean up the table,
we shouldn't drop the snapshots anyway. So this check should come first in the
commit.
I was thinking about this before, I'm not sure if the operation to remove
snapshots should be coupled with the ability to clean up the files. We know
reliably which snapshots which should be removed based on the updated algorithm
so it seems a bit unnecessarily constraining to fail the entire operation in
case we can't do the deletion of files.
Currently, if I'm not mistaken, deletion of the files themselves is already
best effort and can fail, so it's already not a guarantee that we can reliably
clean up the table. So following this, I think it makes sense to allow the
removal of snapshot metadata and then prevent the deletion of files in the case
of multiple branches. Right now we're preventing the operation by throwing an
exception which as @kbendick pointed out, to a user would seem like the
operation failed but it didn't from a snapshot removal perspective. Another way
to handle this is to ignore the cleanFiles flag in case there are multiple
branches and warn log that nothing is being deleted because of the current
constraint (instead of throwing an exception). But that also does not seem
clean since it's not particularly explicit.
@kbendick @rdblue Let me know your thoughts!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]