[
https://issues.apache.org/jira/browse/KAFKA-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
José Armando García Sancio updated KAFKA-16927:
-----------------------------------------------
Fix Version/s: 3.9.0
> Adding tests for restarting followers receiving leader endpoint correctly
> -------------------------------------------------------------------------
>
> Key: KAFKA-16927
> URL: https://issues.apache.org/jira/browse/KAFKA-16927
> Project: Kafka
> Issue Type: Sub-task
> Components: kraft
> Reporter: Alyssa Huang
> Assignee: José Armando García Sancio
> Priority: Major
> Fix For: 3.9.0
>
>
> Restarting an inactive controller fails to start if the active leader has
> more endpoint than the latest voter set. The easiest to reproduce is with the
> following configuration
> {code:java}
> cat kafka.properties | grep controller.quorum
> controller.quorum.voters=0@controller-0:1234,1@controller-1:1234,2@controller-2:1234{code}
> This is what is executed in the QuorumState loading code:
> {code:java}
> if (leaderEndpoints.isEmpty()) {
> ...
> } else {
> initialState = new FollowerState(
> time,
> election.epoch(),
> election.leaderId(),
> leaderEndpoints,
> voters.voterIds(),
> Optional.empty(),
> fetchTimeoutMs,
> logContext
> );
> }{code}
> If the leader has two endpoints it will send the following BEGIN_QUORUM_EPOCH
> request:
> {code:java}
>
> "leaderEndpoints":[{"name":"CONTROLLER_PLAINTEXT","host":"controller-0","port":1234},{"name":"CONTROLLER","host":"controller-0","port":4321}]{code}
> And this code doesn't handle that correctly:
> {code:java}
> } else if (
> leaderId.isPresent() &&
> (!quorum.hasLeader() || leaderEndpoints.size() >
> quorum.leaderEndpoints().size())
> ) {
> // The request or response indicates the leader of the current
> epoch
> // which are currently unknown or the replica has discovered
> more endpoints
> transitionToFollower(epoch, leaderId.getAsInt(),
> leaderEndpoints, currentTimeMs);
> }{code}
> After adding a test for this, the test fails with the following:
> {code:java}
> Gradle Test Run :raft:test > Gradle Test Executor 44 > KafkaRaftClientTest >
> testHandleBeginQuorumRequestMoreEndpoints() FAILED
> java.lang.AssertionError: Assertion failed with an exception
> at
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:453)
> at
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393)
> at
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377)
> at
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:367)
> at
> org.apache.kafka.raft.RaftClientTestContext.pollUntil(RaftClientTestContext.java:617)
> at
> org.apache.kafka.raft.RaftClientTestContext.pollUntilResponse(RaftClientTestContext.java:624)
> at
> org.apache.kafka.raft.KafkaRaftClientTest.testHandleBeginQuorumRequestMoreEndpoints(KafkaRaftClientTest.java:993)
> Caused by:
> java.lang.IllegalStateException: Cannot transition to Follower with
> leader 829 and epoch 3 from state FollowerState(fetchTimeoutMs=50000,
> epoch=3, leader=829,
> leaderEndpoints=Endpoints(endpoints={ListenerName(LISTENER)=localhost/<unresolved>:10819}),
> voters=[828, 829], highWatermark=Optional.empty,
> fetchingSnapshot=Optional.empty)
> at
> org.apache.kafka.raft.QuorumState.transitionToFollower(QuorumState.java:480)
> at
> org.apache.kafka.raft.KafkaRaftClient.transitionToFollower(KafkaRaftClient.java:732)
> at
> org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:2434)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleBeginQuorumEpochRequest(KafkaRaftClient.java:1018)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleRequest(KafkaRaftClient.java:2565)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2613)
> at
> org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3314)
> at
> org.apache.kafka.raft.RaftClientTestContext.lambda$pollUntil$1(RaftClientTestContext.java:618)
> at
> org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396)
> at
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444)
> ... 6 more
> 1 test completed, 1 failed {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)