[
https://issues.apache.org/jira/browse/SOLR-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Endika Posadas updated SOLR-5945:
---------------------------------
Attachment: (was: retryConnectingToZookeeper.patch)
> Add retry for zookeeper reconnect failure
> -----------------------------------------
>
> Key: SOLR-5945
> URL: https://issues.apache.org/jira/browse/SOLR-5945
> Project: Solr
> Issue Type: Improvement
> Components: SolrCloud
> Affects Versions: 4.7
> Reporter: Jessica Cheng Mallet
> Priority: Major
> Labels: solrcloud, zookeeper
>
> We had some network issue where we temporarily lost connection and DNS. The
> zookeeper client properly triggered the watcher. However, when trying to
> reconnect, this following Exception is thrown:
> 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line
> 121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or service
> not known
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
> at
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1211)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getAllByName(InetAddress.java:1063)
> at
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
> at
> org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41)
> at
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
> at
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> I tried to look at the code and it seems that there'd be no further retries
> to connect to Zookeeper, and the node is basically left in a bad state and
> will not recover on its own. (Please correct me if I'm reading this wrong.)
> Thinking about it, this is probably fair, since normally you wouldn't expect
> retries to fix an "unknown host" issue (even though in our case it would
> have) but I'm wondering what we should do to handle this situation if it
> happens again in the future.
> Any advice is appreciated.
> From Mark Miller:
> We don’t currently retry, but I don’t think it would hurt much if we did - at
> least briefly.
> If you want to file a JIRA issue, that would be the best way to get it in a
> future release.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]