[
https://issues.apache.org/jira/browse/KAFKA-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764440#comment-17764440
]
Justine Olshan commented on KAFKA-15459:
----------------------------------------
Hey [~tombentley] thanks for taking a look.
The struggle here is that we want to support old clients.
Adding errors safely requires a produce request version bump that I was hoping
to avoid for part 1 since it addresses older clients. We are stuck between
having specific errors for Java clients that are handled correctly (ie we
retry) or specific fatal errors for cases that should not be fatal. We opted
for a middle ground of non-specific error, but a message in the response to
specify.
I agree that newer clients should be able to support specific errors and in
part 2 we can make this a lot better with the error story and have a single
error code for all retriable errors.
I've had conversations with [~hachikuji] and [~alivshits] about simplifying
into retriable, abortable, and fatal errors. I hope to tackle this in part 2
and I can make a Jira for that as well.
Let me know if there is any other information you would like to get a better
idea of the decision. I will also update the KIP to explain this.
> Convert coordinator retriable errors to a known producer response error.
> ------------------------------------------------------------------------
>
> Key: KAFKA-15459
> URL: https://issues.apache.org/jira/browse/KAFKA-15459
> Project: Kafka
> Issue Type: Sub-task
> Affects Versions: 3.6.0
> Reporter: Justine Olshan
> Assignee: Justine Olshan
> Priority: Blocker
> Fix For: 3.6.0
>
>
> While reviewing [https://github.com/apache/kafka/pull/14370] I added some of
> the documentation for the returned errors in the produce response as well.
> There were concerns about the new errors:
> * {@link Errors#COORDINATOR_LOAD_IN_PROGRESS}
> * {@link Errors#COORDINATOR_NOT_AVAILABLE}
> * {@link Errors#INVALID_TXN_STATE}
> * {@link Errors#INVALID_PRODUCER_ID_MAPPING}
> * {@link Errors#CONCURRENT_TRANSACTIONS}
> The coordinator load, not available, and concurrent transactions errors
> should be retriable.
> The invalid txn state and pid mapping errors should be abortable.
> This is how older java clients handle the errors, but it is unclear how other
> clients handle them. It seems that rdkafka (for example) treats the abortable
> errors as fatal instead. The coordinator errors are retriable but not the
> concurrent transactions error.
> It seems acceptable for the abortable errors to be fatal on some clients
> since the error is likely on a zombie producer or in a state that may be
> harder to recover from. However, for the retriable errors, we can return
> NOT_ENOUGH_REPLICAS which is a known retriable response. We can use the
> produce api's response string to specify the real cause of the error for
> debugging.
> There were trade-offs between making the older clients work and for clarity
> in errors. This seems to be the best compromise.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)