[
https://issues.apache.org/jira/browse/KAFKA-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
José Armando García Sancio updated KAFKA-15489:
-----------------------------------------------
Description:
I found in the current KRaft implementation, when network partition happened
between the current controller leader and the other controller nodes, brokers
may see stale metadata and be unable to update the metadata through the
controller. It causes 2 leaders will exist in the controller cluster at two
different epochs, and 2 stale metadata to be return to the clients.
*Root cause*
In
[KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
we said A voter will begin a new election under three conditions:
1. If it fails to receive a FetchResponse from the current leader before
expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of
quorum.election.timeout.ms after declaring itself a candidate.
And that's exactly what the current KRaft's implementation.
However, when the leader is isolated from the network partition, there's no way
for it to resign from the leadership and start a new election. So the leader
will always be the leader even though all other nodes are down. And this makes
it issue possible for the stale leader to report stale metadata and for the RPC
against to timeout.
When reading further in the KIP-595, I found we indeed considered this
situation and have solution for that. in [this
section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
it said:
{quote}In the pull-based model, however, say a new leader has been elected with
a new epoch and everyone has learned about it except the old leader (e.g. that
leader was not in the voters anymore and hence not receiving the
BeginQuorumEpoch as well), then that old leader would not be notified by anyone
about the new leader / epoch and become a pure "zombie leader", as there is no
regular heartbeats being pushed from leader to the follower. This could lead to
stale information being served to the observers and clients inside the cluster.
{quote}
{quote}To resolve this issue, we will piggy-back on the
"quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch
requests from a majority of the quorum for that amount of time, it would begin
a new election and start sending VoteRequest to voter nodes in the cluster to
understand the latest quorum. If it couldn't connect to any known voter, the
old leader shall keep starting new elections and bump the epoch.
{quote}
But we missed this implementation in current KRaft.
*The flow is like this:*
1. 3 controller nodes, A(leader), B(follower), C(follower)
2. network partition happened between [A] and [B, C].
3. B and C starts new election since fetch timeout expired before receiving
fetch response from leader A.
4. B (or C) is elected as a leader in new epoch, while A is still the leader in
old epoch.
5. broker D creates a topic "new", and updates to leader B.
6. broker E describe topic "new", but got nothing because it is connecting to
the old leader A.
was:
I found in the current KRaft implementation, when network partition happened
between the current controller leader and the other controller nodes, the
"split brain" issue will happen. It causes 2 leaders will exist in the
controller cluster, and 2 inconsistent sets of metadata will return to the
clients.
*Root cause*
In
[KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
we said A voter will begin a new election under three conditions:
1. If it fails to receive a FetchResponse from the current leader before
expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of
quorum.election.timeout.ms after declaring itself a candidate.
And that's exactly what the current KRaft's implementation.
However, when the leader is isolated from the network partition, there's no way
for it to resign from the leadership and start a new election. So the leader
will always be the leader even though all other nodes are down. And this makes
the split brain issue possible.
When reading further in the KIP-595, I found we indeed considered this
situation and have solution for that. in [this
section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
it said:
{quote}In the pull-based model, however, say a new leader has been elected with
a new epoch and everyone has learned about it except the old leader (e.g. that
leader was not in the voters anymore and hence not receiving the
BeginQuorumEpoch as well), then that old leader would not be notified by anyone
about the new leader / epoch and become a pure "zombie leader", as there is no
regular heartbeats being pushed from leader to the follower. This could lead to
stale information being served to the observers and clients inside the cluster.
{quote}
{quote}To resolve this issue, we will piggy-back on the
"quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch
requests from a majority of the quorum for that amount of time, it would begin
a new election and start sending VoteRequest to voter nodes in the cluster to
understand the latest quorum. If it couldn't connect to any known voter, the
old leader shall keep starting new elections and bump the epoch.
{quote}
But we missed this implementation in current KRaft.
*The flow is like this:*
1. 3 controller nodes, A(leader), B(follower), C(follower)
2. network partition happened between [A] and [B, C].
3. B and C starts new election since fetch timeout expired before receiving
fetch response from leader A.
4. B (or C) is elected as a leader in new epoch, while A is still the leader in
old epoch.
5. broker D creates a topic "new", and updates to leader B.
6. broker E describe topic "new", but got nothing because it is connecting to
the old leader A.
> Stale KRaft metadata reported during network partition
> ------------------------------------------------------
>
> Key: KAFKA-15489
> URL: https://issues.apache.org/jira/browse/KAFKA-15489
> Project: Kafka
> Issue Type: Bug
> Components: kraft
> Affects Versions: 3.5.1
> Reporter: Luke Chen
> Assignee: Luke Chen
> Priority: Major
> Fix For: 3.7.0
>
>
> I found in the current KRaft implementation, when network partition happened
> between the current controller leader and the other controller nodes, brokers
> may see stale metadata and be unable to update the metadata through the
> controller. It causes 2 leaders will exist in the controller cluster at two
> different epochs, and 2 stale metadata to be return to the clients.
>
> *Root cause*
> In
> [KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
> we said A voter will begin a new election under three conditions:
> 1. If it fails to receive a FetchResponse from the current leader before
> expiration of quorum.fetch.timeout.ms
> 2. If it receives a EndQuorumEpoch request from the current leader
> 3. If it fails to receive a majority of votes before expiration of
> quorum.election.timeout.ms after declaring itself a candidate.
> And that's exactly what the current KRaft's implementation.
> However, when the leader is isolated from the network partition, there's no
> way for it to resign from the leadership and start a new election. So the
> leader will always be the leader even though all other nodes are down. And
> this makes it issue possible for the stale leader to report stale metadata
> and for the RPC against to timeout.
> When reading further in the KIP-595, I found we indeed considered this
> situation and have solution for that. in [this
> section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
> it said:
> {quote}In the pull-based model, however, say a new leader has been elected
> with a new epoch and everyone has learned about it except the old leader
> (e.g. that leader was not in the voters anymore and hence not receiving the
> BeginQuorumEpoch as well), then that old leader would not be notified by
> anyone about the new leader / epoch and become a pure "zombie leader", as
> there is no regular heartbeats being pushed from leader to the follower. This
> could lead to stale information being served to the observers and clients
> inside the cluster.
> {quote}
> {quote}To resolve this issue, we will piggy-back on the
> "quorum.fetch.timeout.ms" config, such that if the leader did not receive
> Fetch requests from a majority of the quorum for that amount of time, it
> would begin a new election and start sending VoteRequest to voter nodes in
> the cluster to understand the latest quorum. If it couldn't connect to any
> known voter, the old leader shall keep starting new elections and bump the
> epoch.
> {quote}
>
> But we missed this implementation in current KRaft.
>
> *The flow is like this:*
> 1. 3 controller nodes, A(leader), B(follower), C(follower)
> 2. network partition happened between [A] and [B, C].
> 3. B and C starts new election since fetch timeout expired before receiving
> fetch response from leader A.
> 4. B (or C) is elected as a leader in new epoch, while A is still the leader
> in old epoch.
> 5. broker D creates a topic "new", and updates to leader B.
> 6. broker E describe topic "new", but got nothing because it is connecting to
> the old leader A.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)