[
https://issues.apache.org/jira/browse/STORM-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768744#comment-16768744
]
Ramzi Jamal commented on STORM-2415:
------------------------------------
I appropriate you reporting this issue. We are facing this issue as well.
Our storm and zookeeper clusters are containerized and upon a restart of one of
the zookeeper nodes with a new IP address, we noticed that nimbus still attempt
to connect to the old ip address of that zk node.
We traced the issue to the zookeepr client 3.4.6 used by storm (currently
shaded within the storm core jar).
We have seen the issue in storm 1.0.2, 1.0.3 and 1.0.6 and will be testing with
1.2.2, we expect it to fail as well as it also uses zk 3.4.6.
It is clearly an issue that we would appreciate help resolving. I wonder if we
can consider updating the zookeeper dependency in storm to 3.4.13 which
addressed that issue.
Many thanks
> Storm fails to properly handle Zookeeper hosts going down
> ---------------------------------------------------------
>
> Key: STORM-2415
> URL: https://issues.apache.org/jira/browse/STORM-2415
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 1.0.3
> Environment: All
> Reporter: Anthony Milbourne
> Priority: Major
>
> We run a storm cluster (v.1.0.3) on AWS and have 3 Zookeepers supporting it.
> Because AWS sometimes terminates VMs, we sometimes lose a Zookeeper instance.
> When this happens, the hostname cannot be resolved for that zookeeper
> instance as AWS has taken the VM away. We noticed that in this case storm
> fails to connect to zookeeper – even though there are still 2 Zookeeper
> instances running. It fails with an exception something like:
> {noformat}
> java.net.UnknownHostException: zookeeper3
> at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at
> org.apache.storm.shade.org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
>
> at
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
>
> at
> org.apache.storm.shade.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
>
> at
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
>
> at
> org.apache.storm.shade.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
>
> at
> org.apache.storm.shade.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
>
> at
> org.apache.storm.shade.org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
>
> at
> org.apache.storm.shade.org.apache.curator.ConnectionState.start(ConnectionState.java:103)
>
> at
> org.apache.storm.shade.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:190)
>
> at
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259)
>
> at org.apache.storm.zookeeper$mk_client.doInvoke(zookeeper.clj:86)
> at clojure.lang.RestFn.invoke(RestFn.java:494)
> at
> org.apache.storm.cluster_state.zookeeper_state_factory$_mkState.invoke(zookeeper_state_factory.clj:28)
>
> at org.apache.storm.cluster_state.zookeeper_state_factory.mkState(Unknown
> Source)
> <SNIP REST OF STACKTRACE>
> {noformat}
> Having done some research it looks like this error is caused by a bug in the
> Zookeeper client library. There is an issue for it here:
> [https://issues.apache.org/jira/browse/ZOOKEEPER-1576]
> This issue has been resolved in the version 3.5.x branch of Zookeeper.
> However, after 2.5 years and 3 releases the 3.5.x branch of Zookeeper is
> still in Alpha .
> Despite the fact that it is in alpha, there is a branch of Curator (v.3.x.x)
> that uses it, but Storm uses Curator version 2.x.x – possibly because it
> doesn’t rely on alpha code. So the bug is still unpatched in Storm.
> I realise that an upgrade to alpha code may be too much of a risk, but this
> problem is a serious issue for those running Storm in a containerised or
> cloud environment - so perhaps it may be worth considering?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)