[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227519#comment-15227519 ]
Robert Christ commented on KAFKA-3042: -------------------------------------- I work with James and we have seen this problem repeatedly. We have been able to reproduce the problem somewhat reliably and the pattern seems to be: 1) hard kill the controller (say broker 1) 2) after session timeout, the zookeeper session expires for broker 1 3) another node (say broker 2) takes ownership of the /controller node 4) The zookeeper session for broker 2 expires even though broker 2 continues to function (see below) 5) another (say broker 3) takes ownership of the /controller node 6) At some point in the future, possibly after broker 3 finishes taking controllership or broker 1 resumes from the hard stop, broker 2 will spew unending streams of the "Cached zkVersion..." message. 7) Restarting broker 2 will cause the zkVersion problem to go away. While the zkVersion message is appearing the ISR lists do not get updated and we have underreplicated partiions. So 4 is the mystery. I believe it happens because we have some form of network/disk/cpu contention that actually causes the ping from broker 2 not to reach or be acknowledged by zk within the session timeout. We are actively working to try to figure that out but I believe it is triggering some race condition or bug where the active controller loses control of the /controller node and another node takes it. I have logs (oh so many logs) from when this was occurring and can reproduce it fairly easily. > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.4#6332)