[ 
https://issues.apache.org/jira/browse/STORM-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776514#comment-16776514
 ] 

Ramzi Jamal commented on STORM-2415:
------------------------------------

[~Srdo] The main issue is related to zookeeper issue is: 
https://issues.apache.org/jira/browse/ZOOKEEPER-2184, which addresses the 
addresses the fact that the zookeeper client resolve the zk nodes at 
construction time and never re-attempt to resolve them later.

Fixed for zookeeper branch 3.4.13 through these commits PR: 
[https://github.com/apache/zookeeper/commit/2e26c8836edc800c60b204a1d3da0285edb415d6#diff-25d902c24283ab8cfbac54dfa101ad31]

Fixed for current zookeeper master branch (which in line for 3.5.5) through 
these commits: 
[https://github.com/apache/zookeeper/commit/0a311873deb1847703c9b62716c626ce43d4ba48]

> Storm fails to properly handle Zookeeper hosts going down
> ---------------------------------------------------------
>
>                 Key: STORM-2415
>                 URL: https://issues.apache.org/jira/browse/STORM-2415
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.3
>         Environment: All
>            Reporter: Anthony Milbourne
>            Priority: Major
>
> We run a storm cluster (v.1.0.3) on AWS and have 3 Zookeepers supporting it. 
> Because AWS sometimes terminates VMs, we sometimes lose a Zookeeper instance. 
> When this happens, the hostname cannot be resolved for that zookeeper 
> instance as AWS has taken the VM away. We noticed that in this case storm 
> fails to connect to zookeeper – even though there are still 2 Zookeeper 
> instances running. It fails with an exception something like:
> {noformat}
> java.net.UnknownHostException: zookeeper3
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1280) 
>   at java.net.InetAddress.getAllByName(InetAddress.java:1192) 
>   at java.net.InetAddress.getAllByName(InetAddress.java:1126) 
>   at 
> org.apache.storm.shade.org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
>  
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.ConnectionState.start(ConnectionState.java:103)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:190)
>  
>   at 
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259)
>  
>   at org.apache.storm.zookeeper$mk_client.doInvoke(zookeeper.clj:86) 
>   at clojure.lang.RestFn.invoke(RestFn.java:494)
>   at 
> org.apache.storm.cluster_state.zookeeper_state_factory$_mkState.invoke(zookeeper_state_factory.clj:28)
>  
>   at org.apache.storm.cluster_state.zookeeper_state_factory.mkState(Unknown 
> Source) 
>   <SNIP REST OF STACKTRACE>
> {noformat}
> Having done some research it looks like this error is caused by a bug in the 
> Zookeeper client library. There is an issue for it here:
> [https://issues.apache.org/jira/browse/ZOOKEEPER-1576]
> This issue has been resolved in the version 3.5.x branch of Zookeeper. 
> However, after 2.5 years and 3 releases the 3.5.x branch of Zookeeper is 
> still in Alpha .
> Despite the fact that it is in alpha, there is a branch of Curator (v.3.x.x) 
> that uses it, but Storm uses Curator version 2.x.x – possibly because it 
> doesn’t rely on alpha code. So the bug is still unpatched in Storm.
> I realise that an upgrade to alpha code may be too much of a risk, but this 
> problem is a serious issue for those running Storm in a containerised or 
> cloud environment - so perhaps it may be worth considering?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to