[
https://issues.apache.org/jira/browse/KAFKA-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491609#comment-17491609
]
Tim Costa commented on KAFKA-13615:
-----------------------------------
[~guozhang] we can reliably reproduce this issue, to a point. It occurs every
time our application exceeded the max poll interval, however the cause of that
being exceeded is likely a long running http request without a timeout which we
have since corrected, and I don't believe we have seen this since then.
Unfortunately I cannot try to use a newer version of Kafka - We're using AWS
MSK currently, and we're on the latest version offered (2.8.1) and the matching
client library version.
Sorry I can't be more helpful here, I wish I had more information to provide.
It's good to know though that this type of behavior has been reported in the
past, we were worried it was something weird in our code triggering it.
> Kafka Streams does not transition state on LeaveGroup due to poll interval
> being exceeded
> -----------------------------------------------------------------------------------------
>
> Key: KAFKA-13615
> URL: https://issues.apache.org/jira/browse/KAFKA-13615
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 2.8.1
> Reporter: Tim Costa
> Priority: Major
>
> We are running a KafkaStreams application with largely default settings.
> Occasionally one of our consumers in the group takes too long between polls,
> and streams leaves the consumer group but the state of the application
> remains `RUNNING`. We are using the default `max.poll.interval.ms` of 5000.
> The process stays alive with no exception that bubbles to our code, so when
> this occurs our app just kinda sits there idle until a manual restart is
> performed.
> Here are the logs from around the time of the problem:
> {code:java}
> {"timestamp":"2022-01-24
> 19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
> [kubepodname-StreamThread-1] Processed 65296 total records, ran 0
> punctuators, and committed 400 total tasks since the last
> update","context":"default"} {"timestamp":"2022-01-24
> 19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
> [kubepodname-StreamThread-1] Processed 65284 total records, ran 0
> punctuators, and committed 400 total tasks since the last
> update","context":"default"}
> {"timestamp":"2022-01-24
> 20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread |
> stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
> clientId=kubepodname-StreamThread-1-consumer,
> groupId=stage-us-1-fanout-logs-2c99] Member
> kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5
> sending LeaveGroup request to coordinator
> b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to
> consumer poll timeout has expired. This means the time between subsequent
> calls to poll() was longer than the configured max.poll.interval.ms, which
> typically implies that the poll loop is spending too much time processing
> messages. You can address this either by increasing max.poll.interval.ms or
> by reducing the maximum size of batches returned in poll() with
> max.poll.records.","context":"default"} {code}
> At this point the application entirely stops processing data. We initiated a
> shutdown by deleting the kubernetes pod, and the line printed immediately by
> kafka after the sprint boot shutdown initiation logs is the following:
> {code:java}
> {"timestamp":"2022-01-24
> 20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
> [kubepodname-StreamThread-1] State transition from RUNNING to
> PENDING_SHUTDOWN","context":"default"}
> {code}
> For a period of over a minute the application was in a state of hiatus where
> it had left the group, however it was still marked as being in a `RUNNING`
> state so we had no way to detect that the application had entered a bad state
> to kill it in an automated fashion. While the above logs are from an app that
> we shut down within a minute or two manually, we have seen this stay in a bad
> state for up to an hour before.
> It feels like a bug to me that the streams consumer can leave the consumer
> group but not exit the `RUNNING` state. I tried searching for other bugs like
> this, but couldn't find any. Any ideas on how to detect this, or thoughts on
> whether this is actually a bug?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)