[
https://issues.apache.org/jira/browse/KAFKA-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luke Chen resolved KAFKA-15052.
-------------------------------
Resolution: Fixed
> Fix flaky test QuorumControllerTest.testBalancePartitionLeaders()
> -----------------------------------------------------------------
>
> Key: KAFKA-15052
> URL: https://issues.apache.org/jira/browse/KAFKA-15052
> Project: Kafka
> Issue Type: Test
> Reporter: Dimitar Dimitrov
> Assignee: Dimitar Dimitrov
> Priority: Major
> Labels: flaky-test
> Fix For: 3.6.0
>
>
> Test failed at
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/]
> as well as in various local runs.
> The test creates a topic, fences a broker, notes partition imbalance due to
> another broker taking over the partition the fenced broker lost, re-registers
> and unfences the fenced broker, sends {{AlterPartition}} for the lost
> partition adding the now unfenced broker back to its ISR, then waits for the
> partition imbalance to disappear.
> The local failures seem to happen when the brokers (including the ones that
> never get fenced by the test) accidentally get fenced by losing their session
> due to reaching the (aggressively low for test purposes) session timeout.
> The Cloudbees failure quoted above also seems to indicate that this happened:
> {code:java}
> ...[truncated 738209 chars]...
> 23. (org.apache.kafka.controller.QuorumController:768)
> [2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write
> event for maybeBalancePartitionLeaders because scheduled (DEFERRED),
> checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true)
> (org.apache.kafka.controller.QuorumController:1401)
> [2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2
> because its session has timed out.
> (org.apache.kafka.controller.ReplicationControlManager:1459)
> [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced:
> changing partition(s): foo-0, foo-1, foo-2
> (org.apache.kafka.controller.ReplicationControlManager:1750)
> [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for
> foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2
> -> 3, partitionEpoch: 2 -> 3
> (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for
> foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch:
> 3 -> 4, partitionEpoch: 4 -> 5
> (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for
> foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2
> -> 3, partitionEpoch: 2 -> 3
> (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1,
> appendTimestamp=240,
> records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0,
> topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null,
> removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at
> version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1,
> topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null,
> removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at
> version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2,
> topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null,
> removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at
> version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2,
> brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]),
> prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253)
> [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory
> snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197)
> [2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log
> check. (org.apache.kafka.metalog.LocalLogManager:512)
> [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation
> maybeFenceReplicas(451616131) will be completed when the log reaches offset
> 27. (org.apache.kafka.controller.QuorumController:768)
> [2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3
> because its session has timed out.
> (org.apache.kafka.controller.ReplicationControlManager:1459)
> [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced:
> changing partition(s): foo-1
> (org.apache.kafka.controller.ReplicationControlManager:1750)
> [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for
> foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4
> -> 5, partitionEpoch: 5 -> 6
> (org.apache.kafka.controller.ReplicationControlManager:157){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)