[ https://issues.apache.org/jira/browse/KAFKA-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410285#comment-15410285 ]
Jason Gustafson commented on KAFKA-3971: ---------------------------------------- [~wonlay] All consumer instances sharing the same group ID are part of the same consumer group. The point of a consumer group is to balance the consumption load. For example, if you have a topic with 10 partitions and you have 10 consumers in the group, then each instance can be assigned one partition. However, if the consumers in the group each subscribe to a different topic, then they will only be assigned the respective partitions from the topic they subscribed to. In that case, you may as well use a separate group ID because there is no load balancing that can be done. Going further, if only one consumer in the group is subscribing to each topic (as appears to be the case for you), then there is no reason to use a consumer group at all. You can manually assign all the partitions from that topic and avoid the overhead of the rebalance protocol. Instead of calling {{consumer.subscribe()}} as in the snippet you provided above, you would do something like this: {code} List<PartitionInfo> allPartitionInfo = consumer.partitionsFor(topic); Set<TopicPartition> topicPartitions = new HashSet<>(); for (PartitionInfo partitionInfo : allPartitionInfo) topicPartitions.add(new TopicPartition(partitionInfo.topic(), partitionInfo.partition())); consumer.assign(topicPartitions); {code} You can then use the consumer exactly as before. Of course, all of this is assuming that you must have a separate consumer instance for every topic. A more efficient pattern is to have fewer consumers, each of which subscribes to a larger number of topics. For example, instead of having 800 consumers subscribing to one topic, I'd try to get away with maybe 4 consumers each subscribing to 200 topics. Perhaps one or two consumers per available CPU would be a reasonable upper bound? Any more than that and your throughput probably just gets worse. All of that aside, there may still be a bug here which becomes more likely as the size of the group increases. We have not actually done a lot of testing with consumer groups this large, so I'll do some investigation and see if I can reproduce the problem. > Consumers drop from coordinator and cannot reconnet > --------------------------------------------------- > > Key: KAFKA-3971 > URL: https://issues.apache.org/jira/browse/KAFKA-3971 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.9.0.1 > Environment: version 0.9.0.1 > Reporter: Lei Wang > Attachments: KAFKA-3971.txt > > > From time to time, we're creating new topics, and all consumers will pickup > those new topics. When starting to consume from these new topics, we often > see some of random consumers cannot connect to the coordinator. The log will > be flushed with the following log message tens of thousands every second: > {noformat} > 16/07/18 18:18:36.003 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.004 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > 16/07/18 18:18:36.005 INFO (AbstractCoordinator.java:529): Marking the > coordinator 2147483645 dead. > {noformat} > the servers seem working fine, and other consumers are also happy. > from the log, looks like it's retrying multiple times every millisecond but > all failing. > the same process are consuming from many topics, some of them are still > working well, but those random topics will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)