Anthony Milbourne created STORM-2415:
----------------------------------------

             Summary: Storm fails to properly handle Zookeeper hosts going down
                 Key: STORM-2415
                 URL: https://issues.apache.org/jira/browse/STORM-2415
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-core
    Affects Versions: 1.0.3
         Environment: All
            Reporter: Anthony Milbourne


We run a storm cluster (v.1.0.3) on AWS and have 3 Zookeepers supporting it. 
Because AWS sometimes terminates VMs, we sometimes lose a Zookeeper instance. 
When this happens, the hostname cannot be resolved for that zookeeper instance 
as AWS has taken the VM away. We noticed that in this case storm fails to 
connect to zookeeper – even though there are still 2 Zookeeper instances 
running. It fails with an exception something like:
{noformat}
java.net.UnknownHostException: zookeeper3
  at java.net.InetAddress.getAllByName0(InetAddress.java:1280) 
  at java.net.InetAddress.getAllByName(InetAddress.java:1192) 
  at java.net.InetAddress.getAllByName(InetAddress.java:1126) 
  at 
org.apache.storm.shade.org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
 
  at 
org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
 
  at 
org.apache.storm.shade.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
 
  at 
org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
 
  at 
org.apache.storm.shade.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
 
  at 
org.apache.storm.shade.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
 
  at 
org.apache.storm.shade.org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
 
  at 
org.apache.storm.shade.org.apache.curator.ConnectionState.start(ConnectionState.java:103)
 
  at 
org.apache.storm.shade.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:190)
 
  at 
org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259)
 
  at org.apache.storm.zookeeper$mk_client.doInvoke(zookeeper.clj:86) 
  at clojure.lang.RestFn.invoke(RestFn.java:494)
  at 
org.apache.storm.cluster_state.zookeeper_state_factory$_mkState.invoke(zookeeper_state_factory.clj:28)
 
  at org.apache.storm.cluster_state.zookeeper_state_factory.mkState(Unknown 
Source) 
  <SNIP REST OF STACKTRACE>
{noformat}
Having done some research it looks like this error is caused by a bug in the 
Zookeeper client library. There is an issue for it here:
[https://issues.apache.org/jira/browse/ZOOKEEPER-1576]
This issue has been resolved in the version 3.5.x branch of Zookeeper. However, 
after 2.5 years and 3 releases the 3.5.x branch of Zookeeper is still in Alpha .
Despite the fact that it is in alpha, there is a branch of Curator (v.3.x.x) 
that uses it, but Storm uses Curator version 2.x.x – possibly because it 
doesn’t rely on alpha code. So the bug is still unpatched in Storm.
I realise that an upgrade to alpha code may be too much of a risk, but this 
problem is a serious issue for those running Storm in a containerised or cloud 
environment - so perhaps it may be worth considering?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to