Andrew Olson commented on KAFKA-8950:

We're currently testing 2.3.1 now to see if it resolves this issue.

> KafkaConsumer stops fetching
> ----------------------------
>                 Key: KAFKA-8950
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8950
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.3.0
>         Environment: linux
>            Reporter: Will James
>            Priority: Major
>             Fix For: 2.4.0, 2.3.1
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.

This message was sent by Atlassian Jira

Reply via email to