Tim Costa created KAFKA-13615:
---------------------------------

             Summary: Kafka Streams does not transition state on LeaveGroup due 
to poll interval being exceeded
                 Key: KAFKA-13615
                 URL: https://issues.apache.org/jira/browse/KAFKA-13615
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.8.1
            Reporter: Tim Costa


We are running a KafkaStreams application with largely default settings. 
Occasionally one of our consumers in the group takes too long between polls, 
and streams leaves the consumer group but the state of the application remains 
`RUNNING`. We are using the default `max.poll.interval.ms` of 5000.

The process stays alive with no exception that bubbles to our code, so when 
this occurs our app just kinda sits there idle until a manual restart is 
performed.

Here are the logs from around the time of the problem:
{code:java}
{"timestamp":"2022-01-24 
19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 punctuators, 
and committed 400 total tasks since the last update","context":"default"} 
{"timestamp":"2022-01-24 
19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 punctuators, 
and committed 400 total tasks since the last update","context":"default"} 
{"timestamp":"2022-01-24 
20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | 
stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
 clientId=kubepodname-StreamThread-1-consumer, 
groupId=stage-us-1-fanout-logs-2c99] Member 
kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 
sending LeaveGroup request to coordinator 
b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to 
consumer poll timeout has expired. This means the time between subsequent calls 
to poll() was longer than the configured max.poll.interval.ms, which typically 
implies that the poll loop is spending too much time processing messages. You 
can address this either by increasing max.poll.interval.ms or by reducing the 
maximum size of batches returned in poll() with 
max.poll.records.","context":"default"} {code}
At this point the application entirely stops processing data. We initiated a 
shutdown by deleting the kubernetes pod, and the line printed immediately by 
kafka is the following:
{code:java}
{"timestamp":"2022-01-24 
20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] State transition from RUNNING to 
PENDING_SHUTDOWN","context":"default"}
 {code}
For a period of over a minute the application was in a state of hiatus where it 
had left the group, however it was still marked as being in a `RUNNING` state 
so we had no way to detect that the application had entered a bad state to kill 
it in an automated fashion. While the above logs are from an app that we shut 
down within a minute or two manually, we have seen this stay in a bad state for 
up to an hour before.

It feels like a bug to me that the streams consumer can leave the consumer 
group but not exit the `RUNNING` state. I tried searching for other bugs like 
this, but couldn't find any. Any ideas on how to detect this, or thoughts on 
whether this is actually a bug?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to