aokolnychyi commented on a change in pull request #2328:
URL: https://github.com/apache/iceberg/pull/2328#discussion_r595461100



##########
File path: core/src/main/java/org/apache/iceberg/SnapshotProducer.java
##########
@@ -293,8 +293,15 @@ public void commit() {
             taskOps.commit(base, updated.withUUID());
           });
 
+    } catch (CommitFailedException commitFailedException) {
+      // We have an acknowledged failure from the Catalog. We are confident 
that the commit has not been applied
+      Exceptions.suppressAndThrow(commitFailedException, this::cleanAll);
     } catch (RuntimeException e) {
-      Exceptions.suppressAndThrow(e, this::cleanAll);
+      LOG.error("Cannot determine whether the commit was successful or not, 
the underlying data files may or " +

Review comment:
       I thought more about this today and I am still convinced deleting files 
in an attempt to fail other concurrent operations (which is not guaranteed) and 
requiring a manual intervention to restore the table is NOT something we should 
do due to the described consequences and cases that we won't handle.
   
   I also like throwing an exception rather than hanging forever as this seems 
to be more descriptive. I think the error message Russell throws right now is 
pretty elaborate and it should be propagated as a cause of `SparkException`. As 
a user, I'd prefer my job to fail with `UnknownCommitStateException` rather 
than with a job timeout exception and no signs of what went wrong.
   
   Internally, we offer job retry mechanisms but we explicitly warn our users 
that they may enable it only if their jobs are idempotent. We can try to reduce 
the chances of retries but I don't think we can solve that completely when we 
loose the metastore connectivity. I think it is reasonable to explore how we 
can skip already committed batch operations like we do in Structured Streaming 
but that can be done in a follow-up PR.
   
   It may be also a good call to check the commit state more than once for some 
reasonable time frame but it shouldn't be for too long to avoid job timeouts.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to