Jens Rantil created KAFKA-2985:
----------------------------------

             Summary: Consumer group stuck in rebalancing state
                 Key: KAFKA-2985
                 URL: https://issues.apache.org/jira/browse/KAFKA-2985
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 0.9.0.0
         Environment: Kafka 0.9.0.0.
Kafka Java consumer 0.9.0.0

2 Java producers.
3 Java consumers using the new consumer API.
2 kafka brokers.
            Reporter: Jens Rantil
            Assignee: Neha Narkhede


We've doing some load testing on Kafka. _After_ the load test when our 
consumers and have two times now seen Kafka become stuck in consumer group 
rebalancing. This is after all our consumers are done consuming and essentially 
polling periodically without getting any records.

The brokers list the consumer group (named "default"), but I can't query the 
offsets:

{noformat}
jrantil@queue-0:/srv/kafka/kafka$ ./bin/kafka-consumer-groups.sh --new-consumer 
--bootstrap-server localhost:9092 --list
default
jrantil@queue-0:/srv/kafka/kafka$ ./bin/kafka-consumer-groups.sh --new-consumer 
--bootstrap-server localhost:9092 --describe --group default|sort
Consumer group `default` does not exist or is rebalancing.
{noformat}

Retrying to query the offsets for 15 minutes or so still said it was 
rebalancing. After restarting our first broker, the group immediately started 
rebalancing. That broker was logging this before restart:

{noformat}
[2015-12-12 13:09:48,517] INFO [Group Metadata Manager on Broker 0]: Removed 0 
expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2015-12-12 13:10:16,139] INFO [GroupCoordinator 0]: Stabilized group default 
generation 16 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:10:16,141] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 16 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:10:16,575] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 16 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:11:15,141] INFO [GroupCoordinator 0]: Stabilized group default 
generation 17 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:11:15,143] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 17 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:11:15,314] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 17 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:12:14,144] INFO [GroupCoordinator 0]: Stabilized group default 
generation 18 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:12:14,145] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 18 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:12:14,340] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 18 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:13:13,146] INFO [GroupCoordinator 0]: Stabilized group default 
generation 19 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:13:13,148] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 19 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:13:13,238] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 19 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:14:12,148] INFO [GroupCoordinator 0]: Stabilized group default 
generation 20 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:14:12,149] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 20 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:14:12,360] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 20 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:15:11,150] INFO [GroupCoordinator 0]: Stabilized group default 
generation 21 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:15:11,152] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 21 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:15:11,217] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 21 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:16:10,152] INFO [GroupCoordinator 0]: Stabilized group default 
generation 22 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:16:10,154] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 22 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:16:10,339] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 22 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:17:09,155] INFO [GroupCoordinator 0]: Stabilized group default 
generation 23 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:17:09,157] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 23 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:17:09,262] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 23 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:18:08,157] INFO [GroupCoordinator 0]: Stabilized group default 
generation 24 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:18:08,159] INFO [GroupCoordinator 0]: Assignment received from 
leader for group default for generation 24 (kafka.coordinator.GroupCoordinator)
[2015-12-12 13:18:08,333] INFO [GroupCoordinator 0]: Preparing to restabilize 
group default with old generation 24 (kafka.coordinator.GroupCoordinator)
{noformat}

Our consumers were logging:
{noformat}
Dec 12 13:09:17 X.X.X.110 system[27782]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator Marking the 
coordinator 2147483647 dead.
Dec 12 13:09:17 X.X.X.110 system[27782]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Error 
UNKNOWN_MEMBER_ID occurred while committing offsets for group default
Dec 12 13:09:17 X.X.X.110 system[27782]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Auto offset 
commit failed: Commit cannot be completed due to group rebalance
Dec 12 13:09:17 X.X.X.144 system[9915]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator Marking the 
coordinator 2147483647 dead.
Dec 12 13:09:17 X.X.X.144 system[9915]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Error 
UNKNOWN_MEMBER_ID occurred while committing offsets for group default
Dec 12 13:09:17 X.X.X.144 system[9915]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Auto offset 
commit failed: Commit cannot be completed due to group rebalance
Dec 12 13:09:17 X.X.X.110 system[27782]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Error 
UNKNOWN_MEMBER_ID occurred while committing offsets for group default
Dec 12 13:09:17 X.X.X.110 system[27782]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Auto offset 
commit failed:
Dec 12 13:09:17 X.X.X.110 system[27782]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator Attempt to join 
group default failed due to unknown member id, resetting and retrying.
Dec 12 13:09:17 X.X.X.144 system[9915]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Error 
UNKNOWN_MEMBER_ID occurred while committing offsets for group default
Dec 12 13:09:17 X.X.X.144 system[9915]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator Auto offset 
commit failed:
Dec 12 13:09:17 X.X.X.144 system[9915]: [KafkaTaskExecutorConsumer] 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator Attempt to join 
group default failed due to unknown member id, resetting and retrying.
{noformat}

I understand that the broker might start rebalancing if my consumers hasn't 
reported heartbeat in session timeout. This might well have happened during my 
load test. However, the issue here is that the rebalancing doesn't 
stabilize/finish after the load test is done.

Let me know if I can be of any assistance to track this down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to