erkanonl commented on code in PR #14872:
URL: https://github.com/apache/iceberg/pull/14872#discussion_r2631988943
##########
core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java:
##########
@@ -240,6 +246,14 @@ public int succeededCommits() {
return succeededCommits;
}
+ public int failedCommits() {
+ return failedCommits;
Review Comment:
> For instance, what value does surfacing the exceptions to the user add.
> For partial progress scenarios where some commits succeed, some fail, the
exceptions will almost always be CommitFailedException due to concurrent
modifications when retries exhaust. What would a caller do differently based on
having N copies of this exception vs just the count?
In our logs, we saw that there could be exceptions due to different reasons.
One particular exception that we can see as logged by BaseCommitService is that
` X added_records > Y removed_records`. This doesn't look like a standard
commit conflict exception (We have a pending AI to investigate this error
separately). We want to differentiate these kinds of exceptions from regular
CommitFailedExceptions which happen due to concurrent modifications.
> For any other failures like permissions or auth, I'd expect all commits to
fail. There's a rare case where permission failures could happen to a subset of
my table, but that seems edge-case.
Hmm, I'm not sure what would happen there but if we have the exception list,
we can certainly tell it by analysing it, emitting metrics and logging.
> Even with the exception objects, debugging still requires going to logs to
understand which commits failed and why.
Yes, it requires going to logs to understand which commits failed but we
want to classify errors at different categories to investigate further. We
currently only have CommitFailedException category but we saw that
CommitFailedException could happen due to other reasons like I mentioned above.
> if there are many failures, we could accumulate a lot of exception objects
with full stack traces.
Good point, but visibility is also required here and clients should not
hundreds of thousands of commit failed exceptions in BaseCommitService. If they
have, they should fix the underlying root cause.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]