[ 
https://issues.apache.org/jira/browse/FLINK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453076#comment-17453076
 ] 

Fabian Paul commented on FLINK-25040:
-------------------------------------

The problem is being caused by a problem with the internal zookeeper used by 
the kafka cluster in the test container 
{code:java}
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
    at 
kafka.controller.ZkPartitionStateMachine.logFailedStateChange(PartitionStateMachine.scala:508)
    at 
kafka.controller.ZkPartitionStateMachine.$anonfun$initializeLeaderAndIsrForPartitions$10(PartitionStateMachine.scala:314)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at 
kafka.controller.ZkPartitionStateMachine.initializeLeaderAndIsrForPartitions(PartitionStateMachine.scala:304)
    at 
kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:225)
    at 
kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
    at 
kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:542)
    at 
kafka.controller.KafkaController.processTopicChange(KafkaController.scala:1497)
    at kafka.controller.KafkaController.process(KafkaController.scala:1906)
    at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
    at 
kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
    at 
kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
    at 
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
 {code}
Interestingly, zookeeper was already complaining about the latency of the IO 
calls before the error has happened
{code:java}
WARN fsync-ing the write ahead log in SyncThread:0 took 5084ms which will 
adversely effect operation latency. File size is 67108880 bytes. See the 
ZooKeeper troubleshooting guide 
(org.apache.zookeeper.server.persistence.FileTxnLog) {code}
Let's see whether this problem happens more frequently then we might need to 
switch to a more resilient zookeeper setup or use the kafka quorum for broker 
leader election.

> FlinkKafkaInternalProducerITCase.testInitTransactionId failed on AZP
> --------------------------------------------------------------------
>
>                 Key: FLINK-25040
>                 URL: https://issues.apache.org/jira/browse/FLINK-25040
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kafka
>    Affects Versions: 1.14.0
>            Reporter: Till Rohrmann
>            Assignee: Fabian Paul
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.14.1
>
>
> The test {{FlinkKafkaInternalProducerITCase.testInitTransactionId}} failed on 
> AZP with:
> {code}
> Nov 24 09:25:41 [ERROR] 
> org.apache.flink.connector.kafka.sink.FlinkKafkaInternalProducerITCase.testInitTransactionId
>   Time elapsed: 82.766 s  <<< ERROR!
> Nov 24 09:25:41 org.apache.kafka.common.errors.TimeoutException: Timeout 
> expired after 60000 milliseconds while awaiting InitProducerId
> Nov 24 09:25:41 
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26987&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=15a22db7-8faa-5b34-3920-d33c9f0ca23c&l=6726



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to