[ https://issues.apache.org/jira/browse/KAFKA-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jack Vanlightly updated KAFKA-16281: ------------------------------------ Summary: Possible IllegalState with KIP-996 (was: Probable IllegalState possible with KIP-966) > Possible IllegalState with KIP-996 > ---------------------------------- > > Key: KAFKA-16281 > URL: https://issues.apache.org/jira/browse/KAFKA-16281 > Project: Kafka > Issue Type: Task > Components: kraft > Reporter: Jack Vanlightly > Priority: Major > > I have a TLA+ model of KIP-966 and I have identified an IllegalState > exception that would occur with the existing MaybeHandleCommonResponse > behavior. > The issue stems from the fact that a leader, let's call it r1, can resign > (either due to a restart or check quorum) and then later initiate a pre-vote > where it ends up in the same epoch as before, but a cleared local leader id. > When r1 transitions to Prospective it clears its local leader id. When r1 > receives a response from r2 who believes that r1 is still the leader, the > logic in MaybeHandleCommonResponse tries to transition r1 to follower of > itself, causing an IllegalState exception to be raised. > This is an example history: > # r1 is the leader in epoch 1. > # r1 quorum resigns, or restarts and resigns. > # r1 experiences an election timeout and transitions to Prospective clearing > its local leader id. > # r1 sends a pre vote request to its peers. > # r2 thinks r1 is still the leader, sends a vote response, not granting its > vote and setting leaderId=r1 and epoch=1. > # r1 receives the vote response and executes MaybeHandleCommonResponse which > tries to transition r1 to Follower of itself and an illegal state occurs. > The relevant else if statement in MaybeHandleCommonResponse is here: > https://github.com/apache/kafka/blob/a26a1d847f1884a519561e7a4fb4cd13e051c824/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1538 > In the TLA+ specification, I fixed this issue by adding a fourth condition to > this statement, that the leaderId also does not equal this server's id. > [https://github.com/Vanlightly/kafka-tlaplus/blob/9b2600d1cd5c65930d666b12792d47362b64c015/kraft/kip_996/kraft_kip_996_functions.tla#L336] > We should probably create a test to confirm the issue first and then look at > using the fix I made in the TLA+, though there may be other options. -- This message was sent by Atlassian Jira (v8.20.10#820010)