[
https://issues.apache.org/jira/browse/KAFKA-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247287#comment-15247287
]
James Cheng commented on KAFKA-3577:
------------------------------------
The symptoms seem similar to https://issues.apache.org/jira/browse/KAFKA-3042.
Do you know if you are experiencing any high latency between your brokers and
zookeeper, maybe due to GC pauses or network or disk load?
> Partial cluster breakdown
> -------------------------
>
> Key: KAFKA-3577
> URL: https://issues.apache.org/jira/browse/KAFKA-3577
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.9.0.1
> Environment: Debian GNU/Linux 7.9 (wheezy)
> Reporter: Kim Christensen
>
> We run a cluster of 3 brokers and 3 zookeepers, but after we upgraded to
> 0.9.0.1 our cluster sometimes goes partially down, and we can't figure why. A
> full cluster restart fixed the problem.
> I've added a snippet of the logs on each broker below.
> Broker 4:
> {quote}
> [2016-04-18 05:58:26,390] INFO [Group Metadata Manager on Broker 4]: Removed
> 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
> [2016-04-18 06:05:55,218] INFO Creating /controller (is it secure? false)
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,396] ERROR Session has expired while creating
> /controller (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,396] INFO Result of znode creation is: SESSIONEXPIRED
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,400] ERROR Error while electing or becoming leader on
> broker 4 (kafka.server.ZookeeperLeaderElector)
> org.I0Itec.zkclient.exception.ZkException:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired
> at
> org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68)
> at kafka.utils.ZKCheckedEphemeral.create(ZkUtils.scala:1090)
> at
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:81)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> ... 9 more
> [2016-04-18 06:05:57,420] INFO Creating /controller (is it secure? false)
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,424] INFO Result of znode creation is: OK
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,425] INFO 4 successfully elected as leader
> (kafka.server.ZookeeperLeaderElector)
> [2016-04-18 06:05:57,885] INFO [ReplicaFetcherManager on broker 4] Removed
> fetcher for partitions
> [__consumer_offsets,32],[__consumer_offsets,44],[cdrrecords-errors,1],[cdrrecords,0],[__consumer_offsets,38],[__consumer_offsets,8],[events
> ,2],[__consumer_offsets,20],[__consumer_offsets,2],[__consumer_offsets,14],[__consumer_offsets,26]
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:57,892] INFO [ReplicaFetcherManager on broker 4] Removed
> fetcher for partitions
> [__consumer_offsets,35],[__consumer_offsets,23],[__consumer_offsets,47],[__consumer_offsets,11],[__consumer_offsets,5],[events-errors,2],[_
> _consumer_offsets,17],[__consumer_offsets,41],[__consumer_offsets,29]
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-17 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-23 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-29 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-35 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-41 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log events-errors-2 to offset 0.
> (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-5 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-11 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,896] INFO Truncating log __consumer_offsets-47 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,904] INFO [ReplicaFetcherManager on broker 4] Added
> fetcher for partitions List([[__consumer_offsets,17], initOffset 0 to broker
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,23],
> initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] ,
> [[__consumer_offsets,29], initOffset 0 to broker
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,35],
> initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] ,
> [[__consumer_offsets,41], initOffset 0 to broker
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[events-errors,2], initOffset
> 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] ,
> [[__consumer_offsets,5], initOffset 0 to broker
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,11],
> initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] ,
> [[__consumer_offsets,47], initOffset 0 to broker
> BrokerEndPoint(5,kafka304.prod.local,9092)] )
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:57,908] INFO [ReplicaFetcherThread-0-6], Shutting down
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:05:58,783] INFO re-registering broker info in ZK for broker 4
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:05:58,784] INFO Creating /brokers/ids/4 (is it secure? false)
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:58,799] INFO Result of znode creation is: OK
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:58,799] INFO Registered broker 4 at path /brokers/ids/4
> with addresses: PLAINTEXT -> EndPoint(kafka303.prod.local,9092,PLAINTEXT)
> (kafka.utils.ZkUtils)
> [2016-04-18 06:05:58,799] INFO done re-registering broker
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:05:58,799] INFO Subscribing to /brokers/topics path to watch
> for new topics (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:05:58,804] INFO New leader is 4
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:05:58,807] INFO New leader is 4
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Stopped
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Shutdown completed
> (kafka.server.ReplicaFetcherThread)
> {quote}
> Broker 5:
> {quote}
> [2016-04-18 05:57:05,926] INFO [Group Metadata Manager on Broker 5]: Removed
> 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
> [2016-04-18 06:05:56,304] INFO Creating /controller (is it secure? false)
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:56,309] INFO Result of znode creation is: OK
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:56,312] INFO 5 successfully elected as leader
> (kafka.server.ZookeeperLeaderElector)
> [2016-04-18 06:05:58,545] INFO [ReplicaFetcherManager on broker 5] Removed
> fetcher for partitions
> [__consumer_offsets,17],[__consumer_offsets,23],[__consumer_offsets,29],[__consumer_offsets,35],[__consumer_offsets,41],[events-errors,2],[
> __consumer_offsets,5],[__consumer_offsets,11],[__consumer_offsets,47]
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:58,554] INFO [ReplicaFetcherManager on broker 5] Removed
> fetcher for partitions
> [__consumer_offsets,2],[__consumer_offsets,38],[__consumer_offsets,44],[events,2],[__consumer_offsets,32],[cdrrecords-errors,1],[__consumer
> _offsets,8],[__consumer_offsets,20],[cdrrecords,0],[__consumer_offsets,14],[__consumer_offsets,26]
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-32 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-44 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-errors-1 to offset
> 8. (kafka.log.Log)
> [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-0 to offset 172892.
> (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-38 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-8 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log events-2 to offset 29954.
> (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-20 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-2 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-14 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-26 to offset
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,565] INFO [ReplicaFetcherManager on broker 5] Added
> fetcher for partitions List([[__consumer_offsets,32], initOffset 0 to broker
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,44],
> initOffset 0 to
> broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[cdrrecords-errors,1],
> initOffset 8 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ,
> [[cdrrecords,0], initOffset 172892 to broker
> BrokerEndPoint(4,kafka303.prod.local,9092)] ,
> [[__consumer_offsets,38], initOffset 0 to broker
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,8],
> initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ,
> [[events,2], initOffset 29954 to broker BrokerE
> ndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,20], initOffset 0
> to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ,
> [[__consumer_offsets,2], initOffset 0 to broker
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consu
> mer_offsets,14], initOffset 0 to broker
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,26],
> initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] )
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:58,568] INFO [ReplicaFetcherThread-0-6], Shutting down
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:05:58,613] INFO New leader is 4
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Stopped
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Shutdown completed
> (kafka.server.ReplicaFetcherThread)
> {quote}
> Broker 6:
> {quote}
> [2016-04-18 05:57:12,699] INFO [Group Metadata Manager on Broker 6]: Removed
> 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
> [2016-04-18 06:05:55,281] INFO Partition [__consumer_offsets,47] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,656] INFO re-registering broker info in ZK for broker 6
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:06:03,691] INFO Creating /brokers/ids/6 (is it secure? false)
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:06:03,696] INFO Result of znode creation is: OK
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:06:03,696] INFO Partition [__consumer_offsets,47] on broker 6:
> Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,697] INFO Registered broker 6 at path /brokers/ids/6
> with addresses: PLAINTEXT -> EndPoint(kafka305.prod.local,9092,PLAINTEXT)
> (kafka.utils.ZkUtils)
> [2016-04-18 06:06:03,698] INFO done re-registering broker
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:06:03,698] INFO Partition [__consumer_offsets,26] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,26] from 5,6,4 to 6,4
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,699] INFO Subscribing to /brokers/topics path to watch
> for new topics (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:06:03,709] INFO Partition [__consumer_offsets,26] on broker 6:
> Cached zkVersion [88] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,821] INFO New leader is 4
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:06:12,390] INFO Partition [__consumer_offsets,47] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,406] INFO Partition [__consumer_offsets,47] on broker 6:
> Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,410] INFO Partition [__consumer_offsets,41] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,41] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,421] INFO Partition [__consumer_offsets,41] on broker 6:
> Cached zkVersion [80] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,422] INFO Partition [cdrrecords,0] on broker 6:
> Shrinking ISR for partition [cdrrecords,0] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,432] INFO Partition [cdrrecords,0] on broker 6: Cached
> zkVersion [170] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,432] INFO Partition [__consumer_offsets,20] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,20] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,20] on broker 6:
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,32] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,32] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,452] INFO Partition [__consumer_offsets,32] on broker 6:
> Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,452] INFO Partition [cdrrecords-errors,1] on broker 6:
> Shrinking ISR for partition [cdrrecords-errors,1] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,462] INFO Partition [cdrrecords-errors,1] on broker 6:
> Cached zkVersion [135] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,462] INFO Partition [__consumer_offsets,38] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,38] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,38] on broker 6:
> Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,5] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,5] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,5] on broker 6:
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,23] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,23] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,526] INFO Partition [__consumer_offsets,23] on broker 6:
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,526] INFO Partition [events-errors,2] on broker 6:
> Shrinking ISR for partition [events-errors,2] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,536] INFO Partition [events-errors,2] on broker 6:
> Cached zkVersion [167] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,537] INFO Partition [events,2] on broker 6: Shrinking
> ISR for partition [events,2] from 5,6,4 to 6 (kafka.cluster.Partition)
> [2016-04-18 06:06:12,546] INFO Partition [events,2] on broker 6: Cached
> zkVersion [237] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,547] INFO Partition [__consumer_offsets,8] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,8] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,8] on broker 6:
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,44] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,44] from 5,6,4 to 6
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,44] on broker 6:
> Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,29] on broker 6:
> Shrinking ISR for partition [__consumer_offsets,29] from 5,6,4 to 6
> (kafka.cluster.Partition)
> {quote}
> _and this goes on until the broker is restarted_
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)