[jira] [Updated] (KAFKA-15489) Stale KRaft metadata reported during network partition

Jira Tue, 28 Jan 2025 10:31:08 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


José Armando García Sancio updated KAFKA-15489:
-----------------------------------------------
    Description: 
I found in the current KRaft implementation, when network partition happened 
between the current controller leader and the other controller nodes, brokers 
may see stale metadata and be unable to update the metadata through the 
controller. It causes 2 leaders will exist in the controller cluster at two 
different epochs, and 2 stale metadata to be return to the clients.

 

*Root cause*
In 
[KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
 we said A voter will begin a new election under three conditions:

1. If it fails to receive a FetchResponse from the current leader before 
expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of 
quorum.election.timeout.ms after declaring itself a candidate.

And that's exactly what the current KRaft's implementation.

However, when the leader is isolated from the network partition, there's no way 
for it to resign from the leadership and start a new election. So the leader 
will always be the leader even though all other nodes are down. And this makes 
it issue possible for the stale leader to report stale metadata and for the RPC 
against to timeout.

When reading further in the KIP-595, I found we indeed considered this 
situation and have solution for that. in [this 
section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
 it said:
{quote}In the pull-based model, however, say a new leader has been elected with 
a new epoch and everyone has learned about it except the old leader (e.g. that 
leader was not in the voters anymore and hence not receiving the 
BeginQuorumEpoch as well), then that old leader would not be notified by anyone 
about the new leader / epoch and become a pure "zombie leader", as there is no 
regular heartbeats being pushed from leader to the follower. This could lead to 
stale information being served to the observers and clients inside the cluster.
{quote}
{quote}To resolve this issue, we will piggy-back on the 
"quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch 
requests from a majority of the quorum for that amount of time, it would begin 
a new election and start sending VoteRequest to voter nodes in the cluster to 
understand the latest quorum. If it couldn't connect to any known voter, the 
old leader shall keep starting new elections and bump the epoch.
{quote}
 

But we missed this implementation in current KRaft.

 
*The flow is like this:*
1. 3 controller nodes, A(leader), B(follower), C(follower)
2. network partition happened between [A] and [B, C].
3. B and C starts new election since fetch timeout expired before receiving 
fetch response from leader A.
4. B (or C) is elected as a leader in new epoch, while A is still the leader in 
old epoch.
5. broker D creates a topic "new", and updates to leader B.
6. broker E describe topic "new", but got nothing because it is connecting to 
the old leader A.

 

  was:
I found in the current KRaft implementation, when network partition happened 
between the current controller leader and the other controller nodes, the 
"split brain" issue will happen. It causes 2 leaders will exist in the 
controller cluster, and 2 inconsistent sets of metadata will return to the 
clients.

 

*Root cause*
In 
[KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
 we said A voter will begin a new election under three conditions:

1. If it fails to receive a FetchResponse from the current leader before 
expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of 
quorum.election.timeout.ms after declaring itself a candidate.

And that's exactly what the current KRaft's implementation.

 

However, when the leader is isolated from the network partition, there's no way 
for it to resign from the leadership and start a new election. So the leader 
will always be the leader even though all other nodes are down. And this makes 
the split brain issue possible.

When reading further in the KIP-595, I found we indeed considered this 
situation and have solution for that. in [this 
section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
 it said:
{quote}In the pull-based model, however, say a new leader has been elected with 
a new epoch and everyone has learned about it except the old leader (e.g. that 
leader was not in the voters anymore and hence not receiving the 
BeginQuorumEpoch as well), then that old leader would not be notified by anyone 
about the new leader / epoch and become a pure "zombie leader", as there is no 
regular heartbeats being pushed from leader to the follower. This could lead to 
stale information being served to the observers and clients inside the cluster.
{quote}
{quote}To resolve this issue, we will piggy-back on the 
"quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch 
requests from a majority of the quorum for that amount of time, it would begin 
a new election and start sending VoteRequest to voter nodes in the cluster to 
understand the latest quorum. If it couldn't connect to any known voter, the 
old leader shall keep starting new elections and bump the epoch.
{quote}
 

But we missed this implementation in current KRaft.

 
*The flow is like this:*
1. 3 controller nodes, A(leader), B(follower), C(follower)
2. network partition happened between [A] and [B, C].
3. B and C starts new election since fetch timeout expired before receiving 
fetch response from leader A.
4. B (or C) is elected as a leader in new epoch, while A is still the leader in 
old epoch.
5. broker D creates a topic "new", and updates to leader B.
6. broker E describe topic "new", but got nothing because it is connecting to 
the old leader A.

 


> Stale KRaft metadata reported during network partition
> ------------------------------------------------------
>
>                 Key: KAFKA-15489
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15489
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.5.1
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>             Fix For: 3.7.0
>
>
> I found in the current KRaft implementation, when network partition happened 
> between the current controller leader and the other controller nodes, brokers 
> may see stale metadata and be unable to update the metadata through the 
> controller. It causes 2 leaders will exist in the controller cluster at two 
> different epochs, and 2 stale metadata to be return to the clients.
>  
> *Root cause*
> In 
> [KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
>  we said A voter will begin a new election under three conditions:
> 1. If it fails to receive a FetchResponse from the current leader before 
> expiration of quorum.fetch.timeout.ms
> 2. If it receives a EndQuorumEpoch request from the current leader
> 3. If it fails to receive a majority of votes before expiration of 
> quorum.election.timeout.ms after declaring itself a candidate.
> And that's exactly what the current KRaft's implementation.
> However, when the leader is isolated from the network partition, there's no 
> way for it to resign from the leadership and start a new election. So the 
> leader will always be the leader even though all other nodes are down. And 
> this makes it issue possible for the stale leader to report stale metadata 
> and for the RPC against to timeout.
> When reading further in the KIP-595, I found we indeed considered this 
> situation and have solution for that. in [this 
> section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
>  it said:
> {quote}In the pull-based model, however, say a new leader has been elected 
> with a new epoch and everyone has learned about it except the old leader 
> (e.g. that leader was not in the voters anymore and hence not receiving the 
> BeginQuorumEpoch as well), then that old leader would not be notified by 
> anyone about the new leader / epoch and become a pure "zombie leader", as 
> there is no regular heartbeats being pushed from leader to the follower. This 
> could lead to stale information being served to the observers and clients 
> inside the cluster.
> {quote}
> {quote}To resolve this issue, we will piggy-back on the 
> "quorum.fetch.timeout.ms" config, such that if the leader did not receive 
> Fetch requests from a majority of the quorum for that amount of time, it 
> would begin a new election and start sending VoteRequest to voter nodes in 
> the cluster to understand the latest quorum. If it couldn't connect to any 
> known voter, the old leader shall keep starting new elections and bump the 
> epoch.
> {quote}
>  
> But we missed this implementation in current KRaft.
>  
> *The flow is like this:*
> 1. 3 controller nodes, A(leader), B(follower), C(follower)
> 2. network partition happened between [A] and [B, C].
> 3. B and C starts new election since fetch timeout expired before receiving 
> fetch response from leader A.
> 4. B (or C) is elected as a leader in new epoch, while A is still the leader 
> in old epoch.
> 5. broker D creates a topic "new", and updates to leader B.
> 6. broker E describe topic "new", but got nothing because it is connecting to 
> the old leader A.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-15489) Stale KRaft metadata reported during network partition

Reply via email to