[
https://issues.apache.org/jira/browse/KAFKA-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Braedon Vickers updated KAFKA-3984:
-----------------------------------
Affects Version/s: 0.10.1.1
> Broker doesn't retry reconnecting to an expired Zookeeper connection
> --------------------------------------------------------------------
>
> Key: KAFKA-3984
> URL: https://issues.apache.org/jira/browse/KAFKA-3984
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.9.0.1, 0.10.1.1
> Reporter: Braedon Vickers
>
> We've been having issues with the network connectivity of our Kafka cluster,
> and this seems to be triggering an issue where the brokers stop trying to
> reconnect to Zookeeper, leaving us with a broken cluster even when the
> network has recovered.
> When network issues begin we see {{java.net.NoRouteToHostException}}
> exceptions from {{org.apache.zookeeper.ClientCnxn}} as it attempts to
> re-establish the connection. If the network issue resolves itself while we
> are only getting these errors the broker seems to reconnect fine.
> However, a lot of the time we end up with a message like this:
> {code}[2016-07-22 00:21:44,181] FATAL Could not establish session with
> zookeeper (kafka.server.KafkaHealthcheck)
> org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper
> hosts>
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
> ...
> Caused by: java.net.UnknownHostException: <zookeeper host>
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> ...
> {code}
> (apologies for the partial stack traces - I'm having to try and reconstruct
> them from a less than ideal centralised logging setup.)
> If this happens, the broker stops trying to reconnect to Zookeeper, and we
> have to restart it.
> It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state
> isn't {{Expired}} it will keep retrying the connection, and will recover OK
> when the network is back. However, once it changes to {{Expired}} (not
> entirely sure how that happens - based on the session timeout perhaps?)
> zkclient closes the existing client and attempts to create a new one. If the
> network is still down, the client constructor throws a
> {{java.net.UnknownHostException}}, zkclient calls
> {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}},
> {{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error
> and does nothing else.
> It seems like some form of retry needs to happen here, or the broker is stuck
> with no Zookeeper connection
> indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to
> kill the JVM, but that was removed in
> https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be
> better than doing nothing, as then your init system could restart it,
> allowing it to recover once the network was back.
> Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well.
> However, it seems likely, as there doesn't seem to be any code changes in
> kafka or zkclient that would affect this behaviour.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)