[ https://issues.apache.org/jira/browse/KAFKA-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250531#comment-15250531 ]
Kim Christensen commented on KAFKA-3577: ---------------------------------------- Yeah, it seems a lot like the issue you referenced. We can't find any evidence that we're are experiencing high latency, but the current logging of network are disk load leaves a lot to be desired, so it is very likely the issue. We try to increase the zookeeper session timeout for now, and see if the issue is resolved, until our logging of possible latency issues are up to date. > Partial cluster breakdown > ------------------------- > > Key: KAFKA-3577 > URL: https://issues.apache.org/jira/browse/KAFKA-3577 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.1 > Environment: Debian GNU/Linux 7.9 (wheezy) > Reporter: Kim Christensen > > We run a cluster of 3 brokers and 3 zookeepers, but after we upgraded to > 0.9.0.1 our cluster sometimes goes partially down, and we can't figure why. A > full cluster restart fixed the problem. > I've added a snippet of the logs on each broker below. > Broker 4: > {quote} > [2016-04-18 05:58:26,390] INFO [Group Metadata Manager on Broker 4]: Removed > 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) > [2016-04-18 06:05:55,218] INFO Creating /controller (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,396] ERROR Session has expired while creating > /controller (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,396] INFO Result of znode creation is: SESSIONEXPIRED > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,400] ERROR Error while electing or becoming leader on > broker 4 (kafka.server.ZookeeperLeaderElector) > org.I0Itec.zkclient.exception.ZkException: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) > at kafka.utils.ZKCheckedEphemeral.create(ZkUtils.scala:1090) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:81) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > ... 9 more > [2016-04-18 06:05:57,420] INFO Creating /controller (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,424] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,425] INFO 4 successfully elected as leader > (kafka.server.ZookeeperLeaderElector) > [2016-04-18 06:05:57,885] INFO [ReplicaFetcherManager on broker 4] Removed > fetcher for partitions > [__consumer_offsets,32],[__consumer_offsets,44],[cdrrecords-errors,1],[cdrrecords,0],[__consumer_offsets,38],[__consumer_offsets,8],[events > ,2],[__consumer_offsets,20],[__consumer_offsets,2],[__consumer_offsets,14],[__consumer_offsets,26] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:57,892] INFO [ReplicaFetcherManager on broker 4] Removed > fetcher for partitions > [__consumer_offsets,35],[__consumer_offsets,23],[__consumer_offsets,47],[__consumer_offsets,11],[__consumer_offsets,5],[events-errors,2],[_ > _consumer_offsets,17],[__consumer_offsets,41],[__consumer_offsets,29] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-17 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-23 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-29 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-35 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-41 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log events-errors-2 to offset 0. > (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-5 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-11 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,896] INFO Truncating log __consumer_offsets-47 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,904] INFO [ReplicaFetcherManager on broker 4] Added > fetcher for partitions List([[__consumer_offsets,17], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,23], > initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,29], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,35], > initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,41], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[events-errors,2], initOffset > 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,5], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,11], > initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,47], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] ) > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:57,908] INFO [ReplicaFetcherThread-0-6], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:05:58,783] INFO re-registering broker info in ZK for broker 4 > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:05:58,784] INFO Creating /brokers/ids/4 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:58,799] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:58,799] INFO Registered broker 4 at path /brokers/ids/4 > with addresses: PLAINTEXT -> EndPoint(kafka303.prod.local,9092,PLAINTEXT) > (kafka.utils.ZkUtils) > [2016-04-18 06:05:58,799] INFO done re-registering broker > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:05:58,799] INFO Subscribing to /brokers/topics path to watch > for new topics (kafka.server.KafkaHealthcheck) > [2016-04-18 06:05:58,804] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:05:58,807] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Shutdown completed > (kafka.server.ReplicaFetcherThread) > {quote} > Broker 5: > {quote} > [2016-04-18 05:57:05,926] INFO [Group Metadata Manager on Broker 5]: Removed > 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) > [2016-04-18 06:05:56,304] INFO Creating /controller (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:56,309] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:56,312] INFO 5 successfully elected as leader > (kafka.server.ZookeeperLeaderElector) > [2016-04-18 06:05:58,545] INFO [ReplicaFetcherManager on broker 5] Removed > fetcher for partitions > [__consumer_offsets,17],[__consumer_offsets,23],[__consumer_offsets,29],[__consumer_offsets,35],[__consumer_offsets,41],[events-errors,2],[ > __consumer_offsets,5],[__consumer_offsets,11],[__consumer_offsets,47] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:58,554] INFO [ReplicaFetcherManager on broker 5] Removed > fetcher for partitions > [__consumer_offsets,2],[__consumer_offsets,38],[__consumer_offsets,44],[events,2],[__consumer_offsets,32],[cdrrecords-errors,1],[__consumer > _offsets,8],[__consumer_offsets,20],[cdrrecords,0],[__consumer_offsets,14],[__consumer_offsets,26] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-32 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-44 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-errors-1 to offset > 8. (kafka.log.Log) > [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-0 to offset 172892. > (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-38 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-8 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log events-2 to offset 29954. > (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-20 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-2 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-14 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-26 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,565] INFO [ReplicaFetcherManager on broker 5] Added > fetcher for partitions List([[__consumer_offsets,32], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,44], > initOffset 0 to > broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[cdrrecords-errors,1], > initOffset 8 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[cdrrecords,0], initOffset 172892 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[__consumer_offsets,38], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,8], > initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[events,2], initOffset 29954 to broker BrokerE > ndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,20], initOffset 0 > to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[__consumer_offsets,2], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consu > mer_offsets,14], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,26], > initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ) > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:58,568] INFO [ReplicaFetcherThread-0-6], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:05:58,613] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Shutdown completed > (kafka.server.ReplicaFetcherThread) > {quote} > Broker 6: > {quote} > [2016-04-18 05:57:12,699] INFO [Group Metadata Manager on Broker 6]: Removed > 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) > [2016-04-18 06:05:55,281] INFO Partition [__consumer_offsets,47] on broker 6: > Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:03,656] INFO re-registering broker info in ZK for broker 6 > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:06:03,691] INFO Creating /brokers/ids/6 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:06:03,696] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:06:03,696] INFO Partition [__consumer_offsets,47] on broker 6: > Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:03,697] INFO Registered broker 6 at path /brokers/ids/6 > with addresses: PLAINTEXT -> EndPoint(kafka305.prod.local,9092,PLAINTEXT) > (kafka.utils.ZkUtils) > [2016-04-18 06:06:03,698] INFO done re-registering broker > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:06:03,698] INFO Partition [__consumer_offsets,26] on broker 6: > Shrinking ISR for partition [__consumer_offsets,26] from 5,6,4 to 6,4 > (kafka.cluster.Partition) > [2016-04-18 06:06:03,699] INFO Subscribing to /brokers/topics path to watch > for new topics (kafka.server.KafkaHealthcheck) > [2016-04-18 06:06:03,709] INFO Partition [__consumer_offsets,26] on broker 6: > Cached zkVersion [88] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:03,821] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:06:12,390] INFO Partition [__consumer_offsets,47] on broker 6: > Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,406] INFO Partition [__consumer_offsets,47] on broker 6: > Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,410] INFO Partition [__consumer_offsets,41] on broker 6: > Shrinking ISR for partition [__consumer_offsets,41] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,421] INFO Partition [__consumer_offsets,41] on broker 6: > Cached zkVersion [80] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,422] INFO Partition [cdrrecords,0] on broker 6: > Shrinking ISR for partition [cdrrecords,0] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,432] INFO Partition [cdrrecords,0] on broker 6: Cached > zkVersion [170] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,432] INFO Partition [__consumer_offsets,20] on broker 6: > Shrinking ISR for partition [__consumer_offsets,20] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,20] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,32] on broker 6: > Shrinking ISR for partition [__consumer_offsets,32] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,452] INFO Partition [__consumer_offsets,32] on broker 6: > Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,452] INFO Partition [cdrrecords-errors,1] on broker 6: > Shrinking ISR for partition [cdrrecords-errors,1] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,462] INFO Partition [cdrrecords-errors,1] on broker 6: > Cached zkVersion [135] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,462] INFO Partition [__consumer_offsets,38] on broker 6: > Shrinking ISR for partition [__consumer_offsets,38] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,38] on broker 6: > Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,5] on broker 6: > Shrinking ISR for partition [__consumer_offsets,5] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,5] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,23] on broker 6: > Shrinking ISR for partition [__consumer_offsets,23] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,526] INFO Partition [__consumer_offsets,23] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,526] INFO Partition [events-errors,2] on broker 6: > Shrinking ISR for partition [events-errors,2] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,536] INFO Partition [events-errors,2] on broker 6: > Cached zkVersion [167] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,537] INFO Partition [events,2] on broker 6: Shrinking > ISR for partition [events,2] from 5,6,4 to 6 (kafka.cluster.Partition) > [2016-04-18 06:06:12,546] INFO Partition [events,2] on broker 6: Cached > zkVersion [237] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,547] INFO Partition [__consumer_offsets,8] on broker 6: > Shrinking ISR for partition [__consumer_offsets,8] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,8] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,44] on broker 6: > Shrinking ISR for partition [__consumer_offsets,44] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,44] on broker 6: > Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,29] on broker 6: > Shrinking ISR for partition [__consumer_offsets,29] from 5,6,4 to 6 > (kafka.cluster.Partition) > {quote} > _and this goes on until the broker is restarted_ -- This message was sent by Atlassian JIRA (v6.3.4#6332)