[jira] [Commented] (KAFKA-10485) Use a separate error code for replication related errors

Chia-Ping Tsai (Jira) Wed, 16 Sep 2020 21:17:57 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197368#comment-17197368
 ]


Chia-Ping Tsai commented on KAFKA-10485:
----------------------------------------

REQUEST_TIMED_OUT is viewed as fatal error by TransactionManager (except for 
TxnOffsetCommitHandler). Does it cause trouble on compatibility if we return 
REQUEST_TIMED_OUT to client?

> Use a separate error code for replication related errors
> --------------------------------------------------------
>
>                 Key: KAFKA-10485
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10485
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Guozhang Wang
>            Priority: Major
>
> Today when coordinator requests involves an append to the internal topic, 
> e.g. a commit / sync-group request sent to the group coordinator, we would 
> capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE 
> to return to the client:
> * UNKNOWN_TOPIC_OR_PARTITION
> * NOT_ENOUGH_REPLICAS
> * NOT_ENOUGH_REPLICAS_AFTER_APPEND
> * REQUEST_TIMED_OUT (for txn coordinator)
> Among those, the second / third case worth reconsideration, because a 
> COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the 
> coordinator unnecessarily with a short backoff time. The forth case is 
> probably also worth revisiting: although the motivation of using 
> COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs 
> unnecessary coordinator re-discovery.
> What would be better, is that for 2)/3) clients would not re-discovery the 
> coordinator, but would just retry with a longer backoff time, and at the same 
> time expose this either through a metric or through warning logs indicate 
> that some other brokers, not the coordinator, is unavailable and causing this 
> operation to be blocked. For 4) clients can just retry without re-discovery. 
> Only for 1) it makes sense to let the clients to re-discover the coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10485) Use a separate error code for replication related errors

Reply via email to