[
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238651#comment-15238651
]
Jun Rao commented on KAFKA-3042:
--------------------------------
The issue seems to be the following. In 0.9.0, we changed the logic a bit in
ReplicaManager.makeFollowers() to ensure that the new leader is in the
liveBrokers of metadataCache. However, during a controller failover, the new
controller first sends leaderAndIsr requests, followed by an UpdateMetaRequest.
So, it is possible when a broker receives a leaderAndIsr request, the
liveBrokers in metadataCache are stale and don't include the leader and
therefore causes the becoming follower logic to error out. Indeed, from broker
1's state-change log, the last UpdateMetaRequest before the error in becoming
follower came from controller 1.
{code}
[2016-04-09 00:40:52,929] TRACE Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:330,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:2,1,4)
for partit
ion [tec1.usqe1.frontend.syncPing,1] in response to UpdateMetadata request sent
by controller 1 epoch 414 with correlation id 877 (state.change.logger)
{code}
In controller 1's log, the last time it updated the live broker list is the
following and it didn't include broker 4 in the live broker list.
{code}
[2016-04-09 00:39:33,005] INFO [BrokerChangeListener on Controller 1]: Newly
added brokers: , deleted brokers: 2, all live brokers: 1,3,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}
To fix this, we should probably send an UpdateMetadataRequest before any
leaderAndIsrRequest during controller failover.
> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
> Key: KAFKA-3042
> URL: https://issues.apache.org/jira/browse/KAFKA-3042
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.2.1
> Environment: jdk 1.7
> centos 6.4
> Reporter: Jiahongchao
> Attachments: controller.log, server.log.2016-03-23-01,
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's
> a follower.
> So after several failed tries, it need to find out who is the leader
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)