[ 
https://issues.apache.org/jira/browse/KAFKA-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mjuarez updated KAFKA-4674:
---------------------------
    Attachment: kafkabroker.20170221.log.zip

> Frequent ISR shrinking and expanding and disconnects among brokers
> ------------------------------------------------------------------
>
>                 Key: KAFKA-4674
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4674
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core
>    Affects Versions: 0.10.0.1
>         Environment: OS: Redhat Linux 2.6.32-431.el6.x86_64
> JDK: 1.8.0_45
>            Reporter: Kaiming Wan
>         Attachments: controller.log.rar, kafkabroker.20170221.log.zip, 
> server.log.2017-01-11-14, zookeeper.out.2017-01-11.log
>
>
>     We use a kafka cluster with 3 brokers in production environment. It works 
> well for several month. Recently, we get the UnderReplicatedPartitions>0 
> warning mail. When we check the log, we find that the partition is always 
> experience ISR shrinking and expanding. And the disconnection exception can 
> be found in controller's log.
>     We also found some deviant output in zookeeper's log which point to a 
> consumer(using old API depends on zookeeper ) which has stopped its work with 
> many lags.
>     Actually, it is not the first time we encounter this problem. When we 
> first met this problem, we also found the same phenomenon and the log output. 
> We solve the problem by deleting the consumer node info in zookeeper. Then 
> everything goes well.
>     However, this time, after we deleting the consumer which already have 
> large lag, the frequent ISR shrinking and expanding didn't stop for a very 
> long time(serveral hours). Though, the issue didn't affect our consumer and 
> producer, we think it will make our cluster unstable. So at last, we solve 
> this problem by restart the controller broker.
>     And now I wander what cause this problem. I check the source code and 
> only know poll timeout will cause disconnection and ISR shrinking. Is the 
> issue related to zookeeper because it will not hold too many metadata 
> modification and make the replication fetch thread take more time?
> I upload the log file in the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to