[
https://issues.apache.org/jira/browse/KAFKA-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487852#comment-17487852
]
Guozhang Wang commented on KAFKA-13615:
---------------------------------------
[~timcosta] Thanks for reporting this issue, what you described seems to be
related to a known issue in the older versions --- I tried to dig it out but
could not by just search keywords, sigh.. maybe I need to refresh my memory and
try again some time later --- anyways, could deterministically reproduce this
issue, and if yes could you try to use a newer version of Kafka and see if the
reproducing procedure still triggers it?
> Kafka Streams does not transition state on LeaveGroup due to poll interval
> being exceeded
> -----------------------------------------------------------------------------------------
>
> Key: KAFKA-13615
> URL: https://issues.apache.org/jira/browse/KAFKA-13615
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 2.8.1
> Reporter: Tim Costa
> Priority: Major
>
> We are running a KafkaStreams application with largely default settings.
> Occasionally one of our consumers in the group takes too long between polls,
> and streams leaves the consumer group but the state of the application
> remains `RUNNING`. We are using the default `max.poll.interval.ms` of 5000.
> The process stays alive with no exception that bubbles to our code, so when
> this occurs our app just kinda sits there idle until a manual restart is
> performed.
> Here are the logs from around the time of the problem:
> {code:java}
> {"timestamp":"2022-01-24
> 19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
> [kubepodname-StreamThread-1] Processed 65296 total records, ran 0
> punctuators, and committed 400 total tasks since the last
> update","context":"default"} {"timestamp":"2022-01-24
> 19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
> [kubepodname-StreamThread-1] Processed 65284 total records, ran 0
> punctuators, and committed 400 total tasks since the last
> update","context":"default"}
> {"timestamp":"2022-01-24
> 20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread |
> stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
> clientId=kubepodname-StreamThread-1-consumer,
> groupId=stage-us-1-fanout-logs-2c99] Member
> kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5
> sending LeaveGroup request to coordinator
> b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to
> consumer poll timeout has expired. This means the time between subsequent
> calls to poll() was longer than the configured max.poll.interval.ms, which
> typically implies that the poll loop is spending too much time processing
> messages. You can address this either by increasing max.poll.interval.ms or
> by reducing the maximum size of batches returned in poll() with
> max.poll.records.","context":"default"} {code}
> At this point the application entirely stops processing data. We initiated a
> shutdown by deleting the kubernetes pod, and the line printed immediately by
> kafka after the sprint boot shutdown initiation logs is the following:
> {code:java}
> {"timestamp":"2022-01-24
> 20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
> [kubepodname-StreamThread-1] State transition from RUNNING to
> PENDING_SHUTDOWN","context":"default"}
> {code}
> For a period of over a minute the application was in a state of hiatus where
> it had left the group, however it was still marked as being in a `RUNNING`
> state so we had no way to detect that the application had entered a bad state
> to kill it in an automated fashion. While the above logs are from an app that
> we shut down within a minute or two manually, we have seen this stay in a bad
> state for up to an hour before.
> It feels like a bug to me that the streams consumer can leave the consumer
> group but not exit the `RUNNING` state. I tried searching for other bugs like
> this, but couldn't find any. Any ideas on how to detect this, or thoughts on
> whether this is actually a bug?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)