[jira] [Commented] (KAFKA-15052) Fix flaky test QuorumControllerTest.testBalancePartitionLeaders()

Dimitar Dimitrov (Jira) Fri, 23 Jun 2023 03:33:04 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736453#comment-17736453
 ]


Dimitar Dimitrov commented on KAFKA-15052:
------------------------------------------

After a good amount of not super-successful local attempts to reproduce this I 
have some evidence that the session timeout bump indeed addresses the issue. 
Although the testing doesn't seem very reliable as reproducing the issue isn't 
reliable to begin with, I've posted https://github.com/apache/kafka/pull/13908.

> Fix flaky test QuorumControllerTest.testBalancePartitionLeaders()
> -----------------------------------------------------------------
>
>                 Key: KAFKA-15052
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15052
>             Project: Kafka
>          Issue Type: Test
>            Reporter: Dimitar Dimitrov
>            Assignee: Dimitar Dimitrov
>            Priority: Major
>              Labels: flaky-test
>
> Test failed at 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/]
>  as well as in various local runs.
> The test creates a topic, fences a broker, notes partition imbalance due to 
> another broker taking over the partition the fenced broker lost, re-registers 
> and unfences the fenced broker, sends {{AlterPartition}} for the lost 
> partition adding the now unfenced broker back to its ISR, then waits for the 
> partition imbalance to disappear.
> The local failures seem to happen when the brokers (including the ones that 
> never get fenced by the test) accidentally get fenced by losing their session 
> due to reaching the (aggressively low for test purposes) session timeout.
> The Cloudbees failure quoted above also seems to indicate that this happened:
> {code:java}
> ...[truncated 738209 chars]...
> 23. (org.apache.kafka.controller.QuorumController:768)
> [2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write 
> event for maybeBalancePartitionLeaders because scheduled (DEFERRED), 
> checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) 
> (org.apache.kafka.controller.QuorumController:1401)
> [2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 
> because its session has timed out. 
> (org.apache.kafka.controller.ReplicationControlManager:1459)
> [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: 
> changing partition(s): foo-0, foo-1, foo-2 
> (org.apache.kafka.controller.ReplicationControlManager:1750)
> [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for 
> foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 
> -> 3, partitionEpoch: 2 -> 3 
> (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for 
> foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: 
> 3 -> 4, partitionEpoch: 4 -> 5 
> (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for 
> foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 
> -> 3, partitionEpoch: 2 -> 3 
> (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, 
> appendTimestamp=240, 
> records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, 
> topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, 
> removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at 
> version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, 
> topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, 
> removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at 
> version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, 
> topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, 
> removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at 
> version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, 
> brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), 
> prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253)
> [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory 
> snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197)
> [2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log 
> check. (org.apache.kafka.metalog.LocalLogManager:512)
> [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation 
> maybeFenceReplicas(451616131) will be completed when the log reaches offset 
> 27. (org.apache.kafka.controller.QuorumController:768)
> [2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 
> because its session has timed out. 
> (org.apache.kafka.controller.ReplicationControlManager:1459)
> [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: 
> changing partition(s): foo-1 
> (org.apache.kafka.controller.ReplicationControlManager:1750)
> [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for 
> foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 
> -> 5, partitionEpoch: 5 -> 6 
> (org.apache.kafka.controller.ReplicationControlManager:157){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-15052) Fix flaky test QuorumControllerTest.testBalancePartitionLeaders()

Reply via email to