[ 
https://issues.apache.org/jira/browse/KAFKA-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987450#comment-16987450
 ] 

ASF GitHub Bot commented on KAFKA-9073:
---------------------------------------

guozhangwang commented on pull request #7630: KAFKA-9073: check assignment in 
requestFailed to avoid NPE
URL: https://github.com/apache/kafka/pull/7630
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kafka Streams State stuck in rebalancing after one of the StreamThread 
> encounters java.lang.IllegalStateException: No current assignment for 
> partition
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9073
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9073
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.3.0
>            Reporter: amuthan Ganeshan
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: KAFKA-9073.log
>
>
> I have a Kafka stream application that stores the incoming messages into a 
> state store, and later during the punctuation period, we store them into a 
> big data persistent store after processing the messages.
> The application consumes from 120 partitions distributed across 40 instances. 
> The application has been running fine without any problem for months, but all 
> of a sudden some of the instances failed because of a stream thread exception 
> saying  
> ```java.lang.IllegalStateException: No current assignment for partition 
> <app_name>-<store_name>-changelog-98```
>  
> And other instances are stuck in the REBALANCING state, and never comes out 
> of it. Here is the full stack trace, I just masked the application-specific 
> app name and store name in the stack trace due to NDA.
>  
> ```
> 2019-10-21 13:27:13,481 ERROR 
> [application.id-a2c06c51-bfc7-449a-a094-d8b770caee92-StreamThread-3] 
> [org.apache.kafka.streams.processor.internals.StreamThread] [] stream-thread 
> [application.id-a2c06c51-bfc7-449a-a094-d8b770caee92-StreamThread-3] 
> Encountered the following error during processing:
> java.lang.IllegalStateException: No current assignment for partition 
> application.id-store_name-changelog-98
>  at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:319)
>  at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.requestFailed(SubscriptionState.java:618)
>  at 
> org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:709)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
>  at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:574)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:388)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:294)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.maybeUpdateStandbyTasks(StreamThread.java:1126)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:923)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:805)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:774)
> ```
>  
> Now I checked the state sore disk usage; it is less than 40% of the total 
> disk space available. Restarting the application solves the problem for a 
> short amount of time, but the error popping up randomly on some other 
> instances quickly. I tried to change the retry and retry.backoff.ms 
> configuration but not helpful at all
> ```
> retries = 2147483647
> retry.backoff.ms
> ```
> After googling for some time I found there was a similar bug reported to the 
> Kafka team in the past, and also notice my stack trace is exactly matching 
> with the stack trace of the reported bug.
> Here is the link for the bug reported on a comparable basis a year ago.
> https://issues.apache.org/jira/browse/KAFKA-7181
>  
> Now I am wondering is there a workaround for this bug though configuration 
> changes, or is there something wrong the way I set up the application, the 
> following are the configuration I have for my stream application.
>  
> ```
> consumer.session.timeout.ms=30000
>  metric.reporters=org.apache.kafka.common.metrics.JmxReporter
>  replication.factor=3
>  metadata.max.age.ms=30000
>  max.partition.fetch.bytes=2000000
>  producer.retries=2147483647
>  bootstrap.servers= <bootstrap server list goes here>
>  metrics.recording.level=DEBUG
>  producer.retry.backoff.ms=60000
>  consumer.auto.offset.reset=latest
>  application.server=0.0.0.0:6063
>  num.standby.replicas=1
>  max.poll.records=2
>  group.initial.rebalance.delay.ms=30000
>  state.dir= <state dir path goes here>
>  heartbeat.interval.ms=10000
>  max.poll.interval.ms=300000
>  num.stream.threads=10
>  application.id= <application id goes here>
> ```
> Note: The original bug reported a year back got a conclusion that it is 
> related to https://issues.apache.org/jira/browse/KAFKA-7657 and reported 
> solved in version 2.2.0, but I am using the latest 2.3.0 version.
> I appreciate your help concerning this bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to