[
https://issues.apache.org/jira/browse/KAFKA-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964526#comment-16964526
]
amuthan Ganeshan commented on KAFKA-9073:
-----------------------------------------
I am more interested in understanding the root cause, [~guozhang] could you
please explain the scenario. Also, is this bug fix going to be a minor release
under 2.3 ?
> Kafka Streams State stuck in rebalancing after one of the StreamThread
> encounters java.lang.IllegalStateException: No current assignment for
> partition
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-9073
> URL: https://issues.apache.org/jira/browse/KAFKA-9073
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 2.3.0
> Reporter: amuthan Ganeshan
> Priority: Major
> Fix For: 2.4.0
>
> Attachments: KAFKA-9073.log
>
>
> I have a Kafka stream application that stores the incoming messages into a
> state store, and later during the punctuation period, we store them into a
> big data persistent store after processing the messages.
> The application consumes from 120 partitions distributed across 40 instances.
> The application has been running fine without any problem for months, but all
> of a sudden some of the instances failed because of a stream thread exception
> saying
> ```java.lang.IllegalStateException: No current assignment for partition
> <app_name>-<store_name>-changelog-98```
>
> And other instances are stuck in the REBALANCING state, and never comes out
> of it. Here is the full stack trace, I just masked the application-specific
> app name and store name in the stack trace due to NDA.
>
> ```
> 2019-10-21 13:27:13,481 ERROR
> [application.id-a2c06c51-bfc7-449a-a094-d8b770caee92-StreamThread-3]
> [org.apache.kafka.streams.processor.internals.StreamThread] [] stream-thread
> [application.id-a2c06c51-bfc7-449a-a094-d8b770caee92-StreamThread-3]
> Encountered the following error during processing:
> java.lang.IllegalStateException: No current assignment for partition
> application.id-store_name-changelog-98
> at
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:319)
> at
> org.apache.kafka.clients.consumer.internals.SubscriptionState.requestFailed(SubscriptionState.java:618)
> at
> org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:709)
> at
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
> at
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
> at
> org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
> at
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
> at
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
> at
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:574)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:388)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:294)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.maybeUpdateStandbyTasks(StreamThread.java:1126)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:923)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:805)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:774)
> ```
>
> Now I checked the state sore disk usage; it is less than 40% of the total
> disk space available. Restarting the application solves the problem for a
> short amount of time, but the error popping up randomly on some other
> instances quickly. I tried to change the retry and retry.backoff.ms
> configuration but not helpful at all
> ```
> retries = 2147483647
> retry.backoff.ms
> ```
> After googling for some time I found there was a similar bug reported to the
> Kafka team in the past, and also notice my stack trace is exactly matching
> with the stack trace of the reported bug.
> Here is the link for the bug reported on a comparable basis a year ago.
> https://issues.apache.org/jira/browse/KAFKA-7181
>
> Now I am wondering is there a workaround for this bug though configuration
> changes, or is there something wrong the way I set up the application, the
> following are the configuration I have for my stream application.
>
> ```
> consumer.session.timeout.ms=30000
> metric.reporters=org.apache.kafka.common.metrics.JmxReporter
> replication.factor=3
> metadata.max.age.ms=30000
> max.partition.fetch.bytes=2000000
> producer.retries=2147483647
> bootstrap.servers= <bootstrap server list goes here>
> metrics.recording.level=DEBUG
> producer.retry.backoff.ms=60000
> consumer.auto.offset.reset=latest
> application.server=0.0.0.0:6063
> num.standby.replicas=1
> max.poll.records=2
> group.initial.rebalance.delay.ms=30000
> state.dir= <state dir path goes here>
> heartbeat.interval.ms=10000
> max.poll.interval.ms=300000
> num.stream.threads=10
> application.id= <application id goes here>
> ```
> Note: The original bug reported a year back got a conclusion that it is
> related to https://issues.apache.org/jira/browse/KAFKA-7657 and reported
> solved in version 2.2.0, but I am using the latest 2.3.0 version.
> I appreciate your help concerning this bug.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)