[
https://issues.apache.org/jira/browse/KAFKA-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717919#comment-16717919
]
Kevin Li commented on KAFKA-7331:
---------------------------------
feel free to close this
> Kafka does not detect broker loss in the event of a network partition within
> the cluster
> ----------------------------------------------------------------------------------------
>
> Key: KAFKA-7331
> URL: https://issues.apache.org/jira/browse/KAFKA-7331
> Project: Kafka
> Issue Type: Bug
> Components: controller, network
> Affects Versions: 1.0.1
> Reporter: Kevin Li
> Priority: Major
>
> We ran into this issue on our production cluster and had to manually remove
> the broker and enable unclean leader elections to get the cluster working
> again. Ideally, Kafka itself could handle network partitions without manual
> intervention.
> The issue is reproducible with the following cross datacenter Kafka cluster
> setup:
> DC 1: Kafka brokers + ZK nodes
> DC 2: Kafka brokers + ZK nodes
> DC 3: Kafka brokers + ZK nodes
> Introduce a network partition on a Kafka broker (brokerA) in DC 1 where it
> cannot reach any hosts (brokers and ZK nodes) in the other 2 datacenters. The
> cluster goes into a state where partitions that brokerA is a leader for will
> only contain brokerA in its ISR. Since brokerA is still reachable by ZK nodes
> in DC 1, it still shows up when querying ZK. The controller thinks brokerA is
> still up and does not elect new leaders for partitions that brokerA is a
> leader for. This causes all those partitions to be down until brokerA is back
> or completely removed from the cluster (in which case unclean leader election
> can elect new leaders for those partitions).
> A faster recovery scenario could be for a majority of hosts (zk nodes?) to
> realize that brokerA is unreachable, and mark it as down so elections for
> partitions it is a leader for could be triggered. This avoids waiting
> indefinitely for the broker to come back or taking action to remove the
> broker from the cluster.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)