[ 
https://issues.apache.org/jira/browse/KAFKA-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Vanlightly updated KAFKA-16281:
------------------------------------
    Summary: Possible IllegalState with KIP-996  (was: Probable IllegalState 
possible with KIP-966)

> Possible IllegalState with KIP-996
> ----------------------------------
>
>                 Key: KAFKA-16281
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16281
>             Project: Kafka
>          Issue Type: Task
>          Components: kraft
>            Reporter: Jack Vanlightly
>            Priority: Major
>
> I have a TLA+ model of KIP-966 and I have identified an IllegalState 
> exception that would occur with the existing MaybeHandleCommonResponse 
> behavior.
> The issue stems from the fact that a leader, let's call it r1, can resign 
> (either due to a restart or check quorum) and then later initiate a pre-vote 
> where it ends up in the same epoch as before, but a cleared local leader id. 
> When r1 transitions to Prospective it clears its local leader id. When r1 
> receives a response from r2 who believes that r1 is still the leader, the 
> logic in MaybeHandleCommonResponse tries to transition r1 to follower of 
> itself, causing an IllegalState exception to be raised.
> This is an example history:
>  # r1 is the leader in epoch 1.
>  # r1 quorum resigns, or restarts and resigns.
>  # r1 experiences an election timeout and transitions to Prospective clearing 
> its local leader id.
>  # r1 sends a pre vote request to its peers.
>  # r2 thinks r1 is still the leader, sends a vote response, not granting its 
> vote and setting leaderId=r1 and epoch=1.
>  # r1 receives the vote response and executes MaybeHandleCommonResponse which 
> tries to transition r1 to Follower of itself and an illegal state occurs.
> The relevant else if statement in MaybeHandleCommonResponse is here: 
> https://github.com/apache/kafka/blob/a26a1d847f1884a519561e7a4fb4cd13e051c824/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1538
> In the TLA+ specification, I fixed this issue by adding a fourth condition to 
> this statement, that the leaderId also does not equal this server's id. 
> [https://github.com/Vanlightly/kafka-tlaplus/blob/9b2600d1cd5c65930d666b12792d47362b64c015/kraft/kip_996/kraft_kip_996_functions.tla#L336]
> We should probably create a test to confirm the issue first and then look at 
> using the fix I made in the TLA+, though there may be other options.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to