[ 
https://issues.apache.org/jira/browse/KAFKA-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534743#comment-14534743
 ] 

Igor Maravić commented on KAFKA-2169:
-------------------------------------

We, Spotify, have just been hit by a BUG that's related to ZkClient. It made a 
whole Kafka cluster go down.

Long story short, we've restarted TOR switch and all of our brokers from the 
cluster lost all the connectivity with the zookeeper cluster, which was living 
in another rack. Because of that, all the connections to Zookeeper got expired.

Everything would be fine if we haven't lost hostname to IP Address mapping for 
some reason. Since we did lost that mapping, we got a UnknownHostException when 
the broker tried to reconnect. This exception got swallowed up, and we never 
got reconnected to Zookeeper, which in turn made our cluster of brokers useless.

If we had "handleSessionEstablishmentError" function, the whole exception could 
be caught, we could just completely kill KafkaServer process and start it 
cleanly, since this kind of exception is fatal for the KafkaCluster.

{code}
2015-05-05T12:49:01.709+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO  
zookeeper.ZooKeeper  - Initiating client connection, 
connectString=zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local
 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@7303d690
2015-05-05T12:49:01.711+00:00 127.0.0.1 apache-kafka[main-EventThread] ERROR 
zookeeper.ClientCnxn  - Error while calling watcher
2015-05-05T12:49:01.711+00:00 127.0.0.1 java.lang.RuntimeException: Exception 
while restarting zk client
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2015-05-05T12:49:01.711+00:00 127.0.0.1 Caused by: 
org.I0Itec.zkclient.exception.ZkException: Unable to connect to 
zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:939)
2015-05-05T12:49:01.711+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458)
2015-05-05T12:49:01.711+00:00 127.0.0.1 ... 3 more
2015-05-05T12:49:01.712+00:00 127.0.0.1 Caused by: 
java.net.UnknownHostException: zookeeper1.spotify.net: Name or service not known
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
java.net.InetAddress.getAllByName0(InetAddress.java:1246)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
java.net.InetAddress.getAllByName(InetAddress.java:1162)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
java.net.InetAddress.getAllByName(InetAddress.java:1098)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
2015-05-05T12:49:01.712+00:00 127.0.0.1 at 
org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64)
2015-05-05T12:49:01.713+00:00 127.0.0.1 ... 5 more
2015-05-05T12:49:01.713+00:00 127.0.0.1 
apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local]
 ERROR zkclient.ZkEventThread  - Error handling event ZkEvent[Children of 
/config/changes changed sent to 
kafka.server.TopicConfigManager$ConfigChangeListener$@17638f6]
2015-05-05T12:49:01.713+00:00 127.0.0.1 java.lang.NullPointerException
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436)
2015-05-05T12:49:01.713+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO  
zookeeper.ClientCnxn  - EventThread shut down
2015-05-05T12:49:01.714+00:00 127.0.0.1 
apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local]
 ERROR zkclient.ZkEventThread  - Error handling event ZkEvent[Data of 
/controller changed sent to 
kafka.server.ZookeeperLeaderElector$LeaderChangeListener@18360394]
2015-05-05T12:49:01.714+00:00 127.0.0.1 java.lang.NullPointerException
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:544)
2015-05-05T12:49:01.714+00:00 127.0.0.1 at 
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
{code}

> Upgrade to zkclient-0.5
> -----------------------
>
>                 Key: KAFKA-2169
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2169
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.0
>            Reporter: Neha Narkhede
>            Assignee: Parth Brahmbhatt
>            Priority: Critical
>
> zkclient-0.5 is released 
> http://mvnrepository.com/artifact/com.101tec/zkclient/0.5 and has the fix for 
> KAFKA-824



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to