[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967705#comment-15967705
 ] 

Jun Rao commented on KAFKA-2729:
--------------------------------

Thanks for the additional info. In both [~Ronghua Lin] and [~allenzhuyi]'s 
case, it seems ZK session expiration had happened. As I mentioned earlier in 
the jira, there is a known issue reported in KAFKA-3083 that when the 
controller's ZK session expires and loses its controller-ship, it's possible 
for this zombie controller to continue updating ZK and/or sending 
LeaderAndIsrRequests to the brokers for a short period of time. When this 
happens, the broker may not have the most up-to-date information about leader 
and isr, which can lead to subsequent ZK failure when isr needs to be updated.

It may take some time to have this issue fixed. In the interim, the workaround 
for this issue is to make sure ZK session expiration never happens. This first 
thing is to figure out what's causing the ZK session to expire. Two common 
causes are (1) long broker GC and (2) network glitches. For (1), one needs to 
tune the GC in the broker properly. For (2), one can look at the reported time 
that the ZK client can't hear from the ZK server and increase the ZK session 
expiration time according.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to