[ https://issues.apache.org/jira/browse/KAFKA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311109#comment-14311109 ]
Gwen Shapira commented on KAFKA-1908: ------------------------------------- I think its a multi-lan scenario. The broker can bind on all available interfaces (0.0.0.0). If the port is blocked on inter-broker interface but not on the network between clients and brokers, the scenario described seems possible (Although I didn't try to replicate myself). In other clusters, this scenario is prevented by having brokers check access to each other periodically (heartbeat) and validate against ZK. If a node is visible in ZK but not accessible in the network, the minority partition is killed (STONITH, or using ZK to case a node to commit suicide) and the majority triggers leader election. Not a simple mechanism to add to Kafka. And I'm not sure if this is a common enough issue to warrant the complexity involved. > Split brain > ----------- > > Key: KAFKA-1908 > URL: https://issues.apache.org/jira/browse/KAFKA-1908 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.2 > Reporter: Alexey Ozeritskiy > > In some cases, there may be two leaders for one partition. > Steps to reproduce: > # We have 3 brokers, 1 partition with 3 replicas: > {code} > TopicAndPartition: [partition,0] Leader: 1 Replicas: [2,1,3] > ISR: [1,2,3] > {code} > # controller works on broker 3 > # let the kafka port be 9092. We execute on broker 1: > {code} > iptables -A INPUT -p tcp --dport 9092 -j REJECT > {code} > # Initiate replica election > # As a result: > Broker 1: > {code} > TopicAndPartition: [partition,0] Leader: 1 Replicas: [2,1,3] > ISR: [1,2,3] > {code} > Broker 2: > {code} > TopicAndPartition: [partition,0] Leader: 2 Replicas: [2,1,3] > ISR: [1,2,3] > {code} > # Flush the iptables rules on broker 1 > Now we can produce messages to {code}[partition,0]{code}. Replica-1 will not > receive new data. A consumer can read data from replica-1 or replica-2. When > it reads from replica-1 it resets the offsets and than can read duplicates > from replica-2. > We saw this situation in our production cluster when it had network problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)