erkanonl commented on code in PR #14872:
URL: https://github.com/apache/iceberg/pull/14872#discussion_r2631988943


##########
core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java:
##########
@@ -240,6 +246,14 @@ public int succeededCommits() {
     return succeededCommits;
   }
 
+  public int failedCommits() {
+    return failedCommits;

Review Comment:
   > For instance, what value does surfacing the exceptions to the user add.
   > For partial progress scenarios where some commits succeed, some fail, the 
exceptions will almost always be CommitFailedException due to concurrent 
modifications when retries exhaust. What would a caller do differently based on 
having N copies of this exception vs just the count?
   
   In our logs, we saw that there could be exceptions due to different reasons. 
One particular exception that we can see as logged by BaseCommitService is that 
` X added_records > Y removed_records`. This doesn't look like a standard 
commit conflict exception (We have a pending AI to investigate this error 
separately). We want to differentiate these kinds of exceptions from regular 
CommitFailedExceptions which happen due to concurrent modifications.
   
   
   > For any other failures like permissions or auth, I'd expect all commits to 
fail. There's a rare case where permission failures could happen to a subset of 
my table, but that seems edge-case.
   
   Hmm, I'm not sure what would happen there but if we have the exception list, 
we can certainly tell it by analysing it, emitting metrics and logging.
   
   > Even with the exception objects, debugging still requires going to logs to 
understand which commits failed and why.
   
   Yes, it requires going to logs to understand which commits failed but we 
want to classify errors at different categories to investigate further. We 
currently only have CommitFailedException category but we saw that 
CommitFailedException could happen due to other reasons like I mentioned above.
   
   > if there are many failures, we could accumulate a lot of exception objects 
with full stack traces.
   
   Good point, but visibility is also required here and clients should not 
hundreds of thousands of commit failed exceptions in BaseCommitService. If they 
have, they should fix the underlying root cause.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to