[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920004#comment-16920004
 ] 

Raman Gupta commented on KAFKA-8803:
------------------------------------

[~bbejeck] If you want to close it go ahead, however, I don't really consider 
any situation in which a stream takes 17 days to recover normal, when using the 
default settings. Furthermore, the documentation for `max.block.ms` does not in 
any way cover this situation. It says:

> These methods can be blocked either because the buffer is full or metadata 
> unavailable.

Neither of these was true in this situation. Furthermore the error message 
says: "This might happen if the broker is slow to respond, if the network 
connection to the broker was interrupted, or if similar circumstances arise." 
Note that these situations explicitly refer to performance and networking 
problems, and do not mention that the broker state for this particular stream 
could be causing the issue.

Furthermore, I still don't see why the broker would continue to experience the 
same UNKNOWN_LEADER_EPOCH error over the course of 17 days. Shouldn't the 
broker's recover on their own and the stream successfully reconnect once they 
do? Any situation in which the client is somehow causing this error to continue 
to happen for 17 days is in my opinion a bug (especially given I had even 
turned off this stream for about 6 of these 17 days, and still the brokers did 
not recover during this period).

Given all that, it seems to me there are still lots of unexplained behavior 
here, and it doesn't make sense to me to close the issue.

> Stream will not start due to TimeoutException: Timeout expired after 
> 60000milliseconds while awaiting InitProducerId
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8803
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8803
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Raman Gupta
>            Priority: Major
>         Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask                : task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 60000milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to