I'm not quite sure how we get into this state, but we've seen this a few times now. Basically, one of our brokers (broker 1 in this case) gets into a state where ISR updates fail forever:
[2013-10-16 06:19:12,448] ERROR Conditional update of path /brokers/topics/search-gateway-wal/partitions/5/state with data { "controller_epoch":62, "isr":[ 1, 3 ], "leader":1, "leader_epoch":61, "version":1 } and expected version 125 failed due to org.apache.zookeeper.\ KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/search-gateway-wal/partitions/5/state (kafka.utils.ZkUtils$) [2013-10-16 06:19:12,448] INFO Partition [search-gateway-wal,5] on broker 1: Cached zkVersion [125] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) This repeats over and over again for a subset of the partitions. Looking at other brokers in the cluster it seems that they think that broker 1 is also the controller and that the partition in this example has the following state: (search-gateway-wal,5) -> (LeaderAndIsrInfo:(Leader:1, ISR:1,LeaderEpoch:61,ControllerEpoch:62),ReplicationFactor:3),AllReplicas:1,3,4) Looking at the code in Partition, it seems that the zkVersion is only ever updated on makeFollower/makeLeader Any ideas on how we may have gotten into this state? /Sam