[ 
https://issues.apache.org/jira/browse/KAFKA-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402337#comment-16402337
 ] 

Jason Gustafson commented on KAFKA-6671:
----------------------------------------

One cause of slow coordinator failover is an oversized __consumer_offsets 
topic. Can you verify the size of the __consumer_offsets partitions and whether 
the log cleaner is enabled?

> Consumer group coordinator releases group before new coordinator is ready.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-6671
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6671
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.1
>            Reporter: Rob Gevers
>            Priority: Major
>
> We regularly have an issue with our Kafka deploys which causes consumers to 
> be unable to consume for an extended period of time (up to an hour) after the 
> deploy finishes. The issue appears to be a side-effect of the way consumer 
> group coordination is managed between nodes. A sample timeline of a deploy 
> looks like the following:
> We initiate a clean shutdown of a node (which we will call kafka-2). We see 
> these traces:
> {noformat}
>  [2018-02-20 09:13:46,935] INFO [GroupCoordinator 1]: Loading group metadata 
> for ConsumerGroup with generation 3041 
> (kafka.coordinator.GroupCoordinator){noformat}
> {noformat}
>  [2018-02-20 09:13:47,788] INFO [GroupCoordinator 2]: Unloading group 
> metadata for ConsumerGroup with generation 3041{noformat}
> At this point kafka-2 is shutdown and restarted successfully. Consumers 
> continue to function fine. Once kafka-2 is back online we see this trace from 
> kafka-1 
> {noformat}
>  [2018-02-20 09:49:30,486] INFO [GroupCoordinator 1]: Unloading group 
> metadata for ConsumerGroup with generation 3041{noformat}
> At this point the consumers go into a loop of "Discovered coordinator 
> Kafka-2"Marking the coordinator Kafka-2 dead". This preempts the heartbeat 
> timer and we even see the heartbeat rate metrics drop to 0. This continues 
> until kafka-2 has finished processing offset data and finally traces
> {noformat}
>  [2018-02-20 10:52:28,956] INFO [GroupCoordinator 2]: Loading group metadata 
> for ConsumerGroup with generation 3041 
> (kafka.coordinator.GroupCoordinator){noformat}
> What seems like a bug to me is that kafka-1 is unloading the consumer group 
> long before kafka-2 is ready to load it. This seems to leave the group in an 
> unusable state, with offset commits failing because they are trying to commit 
> to kafka-2, but kafka-2 keeps responding that it isn't the group coordinator. 
> There is no coordinator for an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to