[ https://issues.apache.org/jira/browse/KAFKA-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mjuarez updated KAFKA-4674: --------------------------- Attachment: kafkabroker.20170221.log.zip > Frequent ISR shrinking and expanding and disconnects among brokers > ------------------------------------------------------------------ > > Key: KAFKA-4674 > URL: https://issues.apache.org/jira/browse/KAFKA-4674 > Project: Kafka > Issue Type: Bug > Components: controller, core > Affects Versions: 0.10.0.1 > Environment: OS: Redhat Linux 2.6.32-431.el6.x86_64 > JDK: 1.8.0_45 > Reporter: Kaiming Wan > Attachments: controller.log.rar, kafkabroker.20170221.log.zip, > server.log.2017-01-11-14, zookeeper.out.2017-01-11.log > > > We use a kafka cluster with 3 brokers in production environment. It works > well for several month. Recently, we get the UnderReplicatedPartitions>0 > warning mail. When we check the log, we find that the partition is always > experience ISR shrinking and expanding. And the disconnection exception can > be found in controller's log. > We also found some deviant output in zookeeper's log which point to a > consumer(using old API depends on zookeeper ) which has stopped its work with > many lags. > Actually, it is not the first time we encounter this problem. When we > first met this problem, we also found the same phenomenon and the log output. > We solve the problem by deleting the consumer node info in zookeeper. Then > everything goes well. > However, this time, after we deleting the consumer which already have > large lag, the frequent ISR shrinking and expanding didn't stop for a very > long time(serveral hours). Though, the issue didn't affect our consumer and > producer, we think it will make our cluster unstable. So at last, we solve > this problem by restart the controller broker. > And now I wander what cause this problem. I check the source code and > only know poll timeout will cause disconnection and ISR shrinking. Is the > issue related to zookeeper because it will not hold too many metadata > modification and make the replication fetch thread take more time? > I upload the log file in the attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346)