Guozhang Wang created KAFKA-10485:
-------------------------------------

             Summary: Use a separate error code for replication related errors
                 Key: KAFKA-10485
                 URL: https://issues.apache.org/jira/browse/KAFKA-10485
             Project: Kafka
          Issue Type: Improvement
            Reporter: Guozhang Wang


Today when coordinator requests involves an append to the internal topic, e.g. 
a commit / sync-group request sent to the group coordinator, we would capture 
the following error and translate them as a COORDINATOR_NOT_AVAILABLE to return 
to the client:

* UNKNOWN_TOPIC_OR_PARTITION
* NOT_ENOUGH_REPLICAS
* NOT_ENOUGH_REPLICAS_AFTER_APPEND
* REQUEST_TIMED_OUT (for txn coordinator)

Among those, the second / third case worth reconsideration, because a 
COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the 
coordinator unnecessarily with a short backoff time. The forth case is probably 
also worth revisiting: although the motivation of using 
COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs 
unnecessary coordinator re-discovery.

What would be better, is that for 2)/3) clients would not re-discovery the 
coordinator, but would just retry with a longer backoff time, and at the same 
time expose this either through a metric or through warning logs indicate that 
some other brokers, not the coordinator, is unavailable and causing this 
operation to be blocked. For 4) clients can just retry without re-discovery. 
Only for 1) it makes sense to let the clients to re-discover the coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to