[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969090#comment-16969090 ]
Tim Van Laer commented on KAFKA-8803: ------------------------------------- I ran into the same issue. One stream instance (the one dealing with partition 52) kept failing with: {code} org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_52, processor=KSTREAM-SOURCE-0000000000, topic=galactica.timeline-aligner.entries-internal.0, partition=52, offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788) ~[timeline-aligner.jar:?] Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId {code} It was automatically restarted every time, but it kept failing (even after stopping the whole group). Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the client started to get into troubles. {code} [2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} {code} [2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} Meta: * Kafka Streams 2.3.1, * Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) I will give the {{max.block.ms}} a shot, but we're first trying a rolling restart of the brokers. > Stream will not start due to TimeoutException: Timeout expired after > 60000milliseconds while awaiting InitProducerId > -------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: Raman Gupta > Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask : task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 60000milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)