[ https://issues.apache.org/jira/browse/KAFKA-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402337#comment-16402337 ]
Jason Gustafson commented on KAFKA-6671: ---------------------------------------- One cause of slow coordinator failover is an oversized __consumer_offsets topic. Can you verify the size of the __consumer_offsets partitions and whether the log cleaner is enabled? > Consumer group coordinator releases group before new coordinator is ready. > -------------------------------------------------------------------------- > > Key: KAFKA-6671 > URL: https://issues.apache.org/jira/browse/KAFKA-6671 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.2.1 > Reporter: Rob Gevers > Priority: Major > > We regularly have an issue with our Kafka deploys which causes consumers to > be unable to consume for an extended period of time (up to an hour) after the > deploy finishes. The issue appears to be a side-effect of the way consumer > group coordination is managed between nodes. A sample timeline of a deploy > looks like the following: > We initiate a clean shutdown of a node (which we will call kafka-2). We see > these traces: > {noformat} > [2018-02-20 09:13:46,935] INFO [GroupCoordinator 1]: Loading group metadata > for ConsumerGroup with generation 3041 > (kafka.coordinator.GroupCoordinator){noformat} > {noformat} > [2018-02-20 09:13:47,788] INFO [GroupCoordinator 2]: Unloading group > metadata for ConsumerGroup with generation 3041{noformat} > At this point kafka-2 is shutdown and restarted successfully. Consumers > continue to function fine. Once kafka-2 is back online we see this trace from > kafka-1 > {noformat} > [2018-02-20 09:49:30,486] INFO [GroupCoordinator 1]: Unloading group > metadata for ConsumerGroup with generation 3041{noformat} > At this point the consumers go into a loop of "Discovered coordinator > Kafka-2"Marking the coordinator Kafka-2 dead". This preempts the heartbeat > timer and we even see the heartbeat rate metrics drop to 0. This continues > until kafka-2 has finished processing offset data and finally traces > {noformat} > [2018-02-20 10:52:28,956] INFO [GroupCoordinator 2]: Loading group metadata > for ConsumerGroup with generation 3041 > (kafka.coordinator.GroupCoordinator){noformat} > What seems like a bug to me is that kafka-1 is unloading the consumer group > long before kafka-2 is ready to load it. This seems to leave the group in an > unusable state, with offset commits failing because they are trying to commit > to kafka-2, but kafka-2 keeps responding that it isn't the group coordinator. > There is no coordinator for an hour. -- This message was sent by Atlassian JIRA (v7.6.3#76005)