Tim Costa created KAFKA-13615:
---------------------------------
Summary: Kafka Streams does not transition state on LeaveGroup due
to poll interval being exceeded
Key: KAFKA-13615
URL: https://issues.apache.org/jira/browse/KAFKA-13615
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 2.8.1
Reporter: Tim Costa
We are running a KafkaStreams application with largely default settings.
Occasionally one of our consumers in the group takes too long between polls,
and streams leaves the consumer group but the state of the application remains
`RUNNING`. We are using the default `max.poll.interval.ms` of 5000.
The process stays alive with no exception that bubbles to our code, so when
this occurs our app just kinda sits there idle until a manual restart is
performed.
Here are the logs from around the time of the problem:
{code:java}
{"timestamp":"2022-01-24
19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
[kubepodname-StreamThread-1] Processed 65296 total records, ran 0 punctuators,
and committed 400 total tasks since the last update","context":"default"}
{"timestamp":"2022-01-24
19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
[kubepodname-StreamThread-1] Processed 65284 total records, ran 0 punctuators,
and committed 400 total tasks since the last update","context":"default"}
{"timestamp":"2022-01-24
20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread |
stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
clientId=kubepodname-StreamThread-1-consumer,
groupId=stage-us-1-fanout-logs-2c99] Member
kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5
sending LeaveGroup request to coordinator
b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to
consumer poll timeout has expired. This means the time between subsequent calls
to poll() was longer than the configured max.poll.interval.ms, which typically
implies that the poll loop is spending too much time processing messages. You
can address this either by increasing max.poll.interval.ms or by reducing the
maximum size of batches returned in poll() with
max.poll.records.","context":"default"} {code}
At this point the application entirely stops processing data. We initiated a
shutdown by deleting the kubernetes pod, and the line printed immediately by
kafka is the following:
{code:java}
{"timestamp":"2022-01-24
20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
[kubepodname-StreamThread-1] State transition from RUNNING to
PENDING_SHUTDOWN","context":"default"}
{code}
For a period of over a minute the application was in a state of hiatus where it
had left the group, however it was still marked as being in a `RUNNING` state
so we had no way to detect that the application had entered a bad state to kill
it in an automated fashion. While the above logs are from an app that we shut
down within a minute or two manually, we have seen this stay in a bad state for
up to an hour before.
It feels like a bug to me that the streams consumer can leave the consumer
group but not exit the `RUNNING` state. I tried searching for other bugs like
this, but couldn't find any. Any ideas on how to detect this, or thoughts on
whether this is actually a bug?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)