Guozhang Wang created KAFKA-10485: ------------------------------------- Summary: Use a separate error code for replication related errors Key: KAFKA-10485 URL: https://issues.apache.org/jira/browse/KAFKA-10485 Project: Kafka Issue Type: Improvement Reporter: Guozhang Wang
Today when coordinator requests involves an append to the internal topic, e.g. a commit / sync-group request sent to the group coordinator, we would capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE to return to the client: * UNKNOWN_TOPIC_OR_PARTITION * NOT_ENOUGH_REPLICAS * NOT_ENOUGH_REPLICAS_AFTER_APPEND * REQUEST_TIMED_OUT (for txn coordinator) Among those, the second / third case worth reconsideration, because a COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the coordinator unnecessarily with a short backoff time. The forth case is probably also worth revisiting: although the motivation of using COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs unnecessary coordinator re-discovery. What would be better, is that for 2)/3) clients would not re-discovery the coordinator, but would just retry with a longer backoff time, and at the same time expose this either through a metric or through warning logs indicate that some other brokers, not the coordinator, is unavailable and causing this operation to be blocked. For 4) clients can just retry without re-discovery. Only for 1) it makes sense to let the clients to re-discover the coordinator. -- This message was sent by Atlassian Jira (v8.3.4#803005)