[
https://issues.apache.org/jira/browse/FLINK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453076#comment-17453076
]
Fabian Paul commented on FLINK-25040:
-------------------------------------
The problem is being caused by a problem with the internal zookeeper used by
the kafka cluster in the test container
{code:java}
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode =
NodeExists
at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
at
kafka.controller.ZkPartitionStateMachine.logFailedStateChange(PartitionStateMachine.scala:508)
at
kafka.controller.ZkPartitionStateMachine.$anonfun$initializeLeaderAndIsrForPartitions$10(PartitionStateMachine.scala:314)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
kafka.controller.ZkPartitionStateMachine.initializeLeaderAndIsrForPartitions(PartitionStateMachine.scala:304)
at
kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:225)
at
kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
at
kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:542)
at
kafka.controller.KafkaController.processTopicChange(KafkaController.scala:1497)
at kafka.controller.KafkaController.process(KafkaController.scala:1906)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
at
kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
at
kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
{code}
Interestingly, zookeeper was already complaining about the latency of the IO
calls before the error has happened
{code:java}
WARN fsync-ing the write ahead log in SyncThread:0 took 5084ms which will
adversely effect operation latency. File size is 67108880 bytes. See the
ZooKeeper troubleshooting guide
(org.apache.zookeeper.server.persistence.FileTxnLog) {code}
Let's see whether this problem happens more frequently then we might need to
switch to a more resilient zookeeper setup or use the kafka quorum for broker
leader election.
> FlinkKafkaInternalProducerITCase.testInitTransactionId failed on AZP
> --------------------------------------------------------------------
>
> Key: FLINK-25040
> URL: https://issues.apache.org/jira/browse/FLINK-25040
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Kafka
> Affects Versions: 1.14.0
> Reporter: Till Rohrmann
> Assignee: Fabian Paul
> Priority: Critical
> Labels: test-stability
> Fix For: 1.14.1
>
>
> The test {{FlinkKafkaInternalProducerITCase.testInitTransactionId}} failed on
> AZP with:
> {code}
> Nov 24 09:25:41 [ERROR]
> org.apache.flink.connector.kafka.sink.FlinkKafkaInternalProducerITCase.testInitTransactionId
> Time elapsed: 82.766 s <<< ERROR!
> Nov 24 09:25:41 org.apache.kafka.common.errors.TimeoutException: Timeout
> expired after 60000 milliseconds while awaiting InitProducerId
> Nov 24 09:25:41
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26987&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=15a22db7-8faa-5b34-3920-d33c9f0ca23c&l=6726
--
This message was sent by Atlassian Jira
(v8.20.1#820001)