[ 
https://issues.apache.org/jira/browse/KAFKA-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717919#comment-16717919
 ] 

Kevin Li commented on KAFKA-7331:
---------------------------------

feel free to close this

> Kafka does not detect broker loss in the event of a network partition within 
> the cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7331
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7331
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, network
>    Affects Versions: 1.0.1
>            Reporter: Kevin Li
>            Priority: Major
>
> We ran into this issue on our production cluster and had to manually remove 
> the broker and enable unclean leader elections to get the cluster working 
> again. Ideally, Kafka itself could handle network partitions without manual 
> intervention.
> The issue is reproducible with the following cross datacenter Kafka cluster 
> setup:
> DC 1: Kafka brokers + ZK nodes
> DC 2: Kafka brokers + ZK nodes
> DC 3: Kafka brokers + ZK nodes
> Introduce a network partition on a Kafka broker (brokerA) in DC 1 where it 
> cannot reach any hosts (brokers and ZK nodes) in the other 2 datacenters. The 
> cluster goes into a state where partitions that brokerA is a leader for will 
> only contain brokerA in its ISR. Since brokerA is still reachable by ZK nodes 
> in DC 1, it still shows up when querying ZK. The controller thinks brokerA is 
> still up and does not elect new leaders for partitions that brokerA is a 
> leader for. This causes all those partitions to be down until brokerA is back 
> or completely removed from the cluster (in which case unclean leader election 
> can elect new leaders for those partitions).
> A faster recovery scenario could be for a majority of hosts (zk nodes?) to 
> realize that brokerA is unreachable, and mark it as down so elections for 
> partitions it is a leader for could be triggered. This avoids waiting 
> indefinitely for the broker to come back or taking action to remove the 
> broker from the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to