I'm not quite sure how we get into this state, but we've seen this a few times 
now. Basically, one of our brokers (broker 1 in this case) gets into a state 
where ISR updates fail forever:

[2013-10-16 06:19:12,448] ERROR Conditional update of path 
/brokers/topics/search-gateway-wal/partitions/5/state with data { 
"controller_epoch":62, "isr":[ 1, 3 ], "leader":1, "leader_epoch":61, 
"version":1 } and expected version 125 failed due to org.apache.zookeeper.\
KeeperException$BadVersionException: KeeperErrorCode = BadVersion for 
/brokers/topics/search-gateway-wal/partitions/5/state (kafka.utils.ZkUtils$)
[2013-10-16 06:19:12,448] INFO Partition [search-gateway-wal,5] on broker 1: 
Cached zkVersion [125] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)

This repeats over and over again for a subset of the partitions. Looking at 
other brokers in the cluster it seems that they think that broker 1 is also the 
controller and that the partition in this example has the following state:

(search-gateway-wal,5) -> (LeaderAndIsrInfo:(Leader:1, 
ISR:1,LeaderEpoch:61,ControllerEpoch:62),ReplicationFactor:3),AllReplicas:1,3,4)

Looking at the code in Partition, it seems that the zkVersion is only ever 
updated on makeFollower/makeLeader

Any ideas on how we may have gotten into this state?

/Sam

Reply via email to