[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879700#comment-15879700
 ] 

Jun Rao commented on KAFKA-2729:
--------------------------------

Sorry to hear about the impact to production. Grant mentioned ZK session 
expiration, which is indeed a potential cause of this issue. A related issue 
has been reported in KAFKA-3083. The issue is that when the controller's ZK 
session expires and loses its controller-ship, it's possible for this zombie 
controller to continue updating ZK and/or sending LeaderAndIsrRequests to the 
brokers for a short period of time. When this happens, the broker may not have 
the most up-to-date information about leader and isr, which can lead to 
subsequent ZK failure when isr needs to be updated.

Fixing this issue requires us change the way how we use the ZK api and may take 
some time. In the interim, one suggestion is to make sure ZK session expiration 
never happens. This can be achieved by making sure that (1) ZK servers are 
performing, (2) the brokers don't have long GCs, (3) the ZK session expiration 
time is large enough.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to