[jira] [Commented] (KAFKA-3971) Consumers drop from coordinator and cannot reconnet

Jason Gustafson (JIRA) Fri, 05 Aug 2016 16:42:35 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410285#comment-15410285
 ]


Jason Gustafson commented on KAFKA-3971:
----------------------------------------

[~wonlay] All consumer instances sharing the same group ID are part of the same 
consumer group. The point of a consumer group is to balance the consumption 
load. For example, if you have a topic with 10 partitions and you have 10 
consumers in the group, then each instance can be assigned one partition. 
However, if the consumers in the group each subscribe to a different topic, 
then they will only be assigned the respective partitions from the topic they 
subscribed to. In that case, you may as well use a separate group ID because 
there is no load balancing that can be done. Going further, if only one 
consumer in the group is subscribing to each topic (as appears to be the case 
for you), then there is no reason to use a consumer group at all. You can 
manually assign all the partitions from that topic and avoid the overhead of 
the rebalance protocol. Instead of calling {{consumer.subscribe()}} as in the 
snippet you provided above, you would do something like this:

{code}
List<PartitionInfo> allPartitionInfo = consumer.partitionsFor(topic);
Set<TopicPartition> topicPartitions = new HashSet<>();
for (PartitionInfo partitionInfo : allPartitionInfo)
  topicPartitions.add(new TopicPartition(partitionInfo.topic(), 
partitionInfo.partition()));
consumer.assign(topicPartitions);
{code}

You can then use the consumer exactly as before. Of course, all of this is 
assuming that you must have a separate consumer instance for every topic. A 
more efficient pattern is to have fewer consumers, each of which subscribes to 
a larger number of topics. For example, instead of having 800 consumers 
subscribing to one topic, I'd try to get away with maybe 4 consumers each 
subscribing to 200 topics. Perhaps one or two consumers per available CPU would 
be a reasonable upper bound? Any more than that and your throughput probably 
just gets worse.

All of that aside, there may still be a bug here which becomes more likely as 
the size of the group increases. We have not actually done a lot of testing 
with consumer groups this large, so I'll do some investigation and see if I can 
reproduce the problem.

> Consumers drop from coordinator and cannot reconnet
> ---------------------------------------------------
>
>                 Key: KAFKA-3971
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3971
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.9.0.1
>         Environment: version 0.9.0.1
>            Reporter: Lei Wang
>         Attachments: KAFKA-3971.txt
>
>
> From time to time, we're creating new topics, and all consumers will pickup 
> those new topics. When starting to consume from these new topics, we often 
> see some of random consumers cannot connect to the coordinator. The log will 
> be flushed with the following log message tens of thousands every second:
> {noformat}
> 16/07/18 18:18:36.003 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> 16/07/18 18:18:36.005 INFO (AbstractCoordinator.java:529): Marking the 
> coordinator 2147483645 dead.
> {noformat}
> the servers seem working fine, and other consumers are also happy.
> from the log, looks like it's retrying multiple times every millisecond but 
> all failing.
> the same process are consuming from many topics, some of them are still 
> working well, but those random topics will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3971) Consumers drop from coordinator and cannot reconnet

Reply via email to