[
https://issues.apache.org/jira/browse/STORM-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776514#comment-16776514
]
Ramzi Jamal commented on STORM-2415:
------------------------------------
[~Srdo] The main issue is related to zookeeper issue is:
https://issues.apache.org/jira/browse/ZOOKEEPER-2184, which addresses the
addresses the fact that the zookeeper client resolve the zk nodes at
construction time and never re-attempt to resolve them later.
Fixed for zookeeper branch 3.4.13 through these commits PR:
[https://github.com/apache/zookeeper/commit/2e26c8836edc800c60b204a1d3da0285edb415d6#diff-25d902c24283ab8cfbac54dfa101ad31]
Fixed for current zookeeper master branch (which in line for 3.5.5) through
these commits:
[https://github.com/apache/zookeeper/commit/0a311873deb1847703c9b62716c626ce43d4ba48]
> Storm fails to properly handle Zookeeper hosts going down
> ---------------------------------------------------------
>
> Key: STORM-2415
> URL: https://issues.apache.org/jira/browse/STORM-2415
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 1.0.3
> Environment: All
> Reporter: Anthony Milbourne
> Priority: Major
>
> We run a storm cluster (v.1.0.3) on AWS and have 3 Zookeepers supporting it.
> Because AWS sometimes terminates VMs, we sometimes lose a Zookeeper instance.
> When this happens, the hostname cannot be resolved for that zookeeper
> instance as AWS has taken the VM away. We noticed that in this case storm
> fails to connect to zookeeper – even though there are still 2 Zookeeper
> instances running. It fails with an exception something like:
> {noformat}
> java.net.UnknownHostException: zookeeper3
> at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at
> org.apache.storm.shade.org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
>
> at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
>
> at
> org.apache.storm.shade.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
>
> at
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
>
> at
> org.apache.storm.shade.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
>
> at
> org.apache.storm.shade.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
>
> at
> org.apache.storm.shade.org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
>
> at
> org.apache.storm.shade.org.apache.curator.ConnectionState.start(ConnectionState.java:103)
>
> at
> org.apache.storm.shade.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:190)
>
> at
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259)
>
> at org.apache.storm.zookeeper$mk_client.doInvoke(zookeeper.clj:86)
> at clojure.lang.RestFn.invoke(RestFn.java:494)
> at
> org.apache.storm.cluster_state.zookeeper_state_factory$_mkState.invoke(zookeeper_state_factory.clj:28)
>
> at org.apache.storm.cluster_state.zookeeper_state_factory.mkState(Unknown
> Source)
> <SNIP REST OF STACKTRACE>
> {noformat}
> Having done some research it looks like this error is caused by a bug in the
> Zookeeper client library. There is an issue for it here:
> [https://issues.apache.org/jira/browse/ZOOKEEPER-1576]
> This issue has been resolved in the version 3.5.x branch of Zookeeper.
> However, after 2.5 years and 3 releases the 3.5.x branch of Zookeeper is
> still in Alpha .
> Despite the fact that it is in alpha, there is a branch of Curator (v.3.x.x)
> that uses it, but Storm uses Curator version 2.x.x – possibly because it
> doesn’t rely on alpha code. So the bug is still unpatched in Storm.
> I realise that an upgrade to alpha code may be too much of a risk, but this
> problem is a serious issue for those running Storm in a containerised or
> cloud environment - so perhaps it may be worth considering?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)