[ 
https://issues.apache.org/jira/browse/KAFKA-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin McCabe updated KAFKA-15489:
---------------------------------
    Labels:   (was: 4.0-blocker)

> split brain in KRaft cluster 
> -----------------------------
>
>                 Key: KAFKA-15489
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15489
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.5.1
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>
> I found in the current KRaft implementation, when network partition happened 
> between the current controller leader and the other controller nodes, the 
> "split brain" issue will happen. It causes 2 leaders will exist in the 
> controller cluster, and 2 inconsistent sets of metadata will return to the 
> clients.
>  
> *Root cause*
> In 
> [KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
>  we said A voter will begin a new election under three conditions:
> 1. If it fails to receive a FetchResponse from the current leader before 
> expiration of quorum.fetch.timeout.ms
> 2. If it receives a EndQuorumEpoch request from the current leader
> 3. If it fails to receive a majority of votes before expiration of 
> quorum.election.timeout.ms after declaring itself a candidate.
> And that's exactly what the current KRaft's implementation.
>  
> However, when the leader is isolated from the network partition, there's no 
> way for it to resign from the leadership and start a new election. So the 
> leader will always be the leader even though all other nodes are down. And 
> this makes the split brain issue possible.
> When reading further in the KIP-595, I found we indeed considered this 
> situation and have solution for that. in [this 
> section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
>  it said:
> {quote}In the pull-based model, however, say a new leader has been elected 
> with a new epoch and everyone has learned about it except the old leader 
> (e.g. that leader was not in the voters anymore and hence not receiving the 
> BeginQuorumEpoch as well), then that old leader would not be notified by 
> anyone about the new leader / epoch and become a pure "zombie leader", as 
> there is no regular heartbeats being pushed from leader to the follower. This 
> could lead to stale information being served to the observers and clients 
> inside the cluster.
> {quote}
> {quote}To resolve this issue, we will piggy-back on the 
> "quorum.fetch.timeout.ms" config, such that if the leader did not receive 
> Fetch requests from a majority of the quorum for that amount of time, it 
> would begin a new election and start sending VoteRequest to voter nodes in 
> the cluster to understand the latest quorum. If it couldn't connect to any 
> known voter, the old leader shall keep starting new elections and bump the 
> epoch.
> {quote}
>  
> But we missed this implementation in current KRaft.
>  
> *The flow is like this:*
> 1. 3 controller nodes, A(leader), B(follower), C(follower)
> 2. network partition happened between [A] and [B, C].
> 3. B and C starts new election since fetch timeout expired before receiving 
> fetch response from leader A.
> 4. B (or C) is elected as a leader in new epoch, while A is still the leader 
> in old epoch.
> 5. broker D creates a topic "new", and updates to leader B.
> 6. broker E describe topic "new", but got nothing because it is connecting to 
> the old leader A.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to