Frank J Kelly created SOLR-8868:
-----------------------------------
Summary: SolrCloud: if zookeeper loses and then regains a quorum
Solr and SolrJ Client still need to be restarted
Key: SOLR-8868
URL: https://issues.apache.org/jira/browse/SOLR-8868
Project: Solr
Issue Type: Bug
Components: SolrCloud, SolrJ
Affects Versions: 5.3.1
Reporter: Frank J Kelly
Tried mailing list on 3/15 and 3/16 to no avail. Hopefully I gave enough
details.
----
Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a
quorum is normal or to-be-expected
Version of Solr: 5.3.1
Version of ZooKeeper: 3.4.7
Using SolrCloud with external ZooKeeper
Deployed on AWS
Our Solr cluster has 3 nodes (m3.large)
Our Zookeeper ensemble consists of three nodes (t2.small) with the same config
using DNS names e.g.
{noformat}
$ more ../conf/zoo.cfg
tickTime=2000
dataDir=/var/zookeeper
dataLogDir=/var/log/zookeeper
clientPort=2181
initLimit=10
syncLimit=5
standaloneEnabled=false
server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888
{noformat}
If we terminate one of the zookeeper nodes we get a ZK election (and I think) a
quorum is maintained.
Operation continues OK and we detect the terminated instance and relaunch a new
ZK node which comes up fine
If we terminate two of the ZK nodes we lose a quorum and then we observe the
following
1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could not
connect to ZooKeeper"
1.2) SolrJ returns the following
{noformat}
org.apache.solr.common.SolrException: Could not load collection from
ZK:qa_eu-west-1_public_index
at
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
at
com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/collections/qa_eu-west-1_public_index/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
... 24 more
{noformat}
This makes sense based on our understanding.
When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix
the DNS etc. we regain a quorum but at this point
2.1) Admin UI shows the shards as “GONE” (all greyed out)
2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now
bound to new IP addresses
So at this point I restart the Solr nodes. At this point then
3.1) Admin UI shows the collections as OK (all shards are green) – yeah the
nodes are back!
3.2) SolrJ Client still shows the same error – namely
{noformat}
org.apache.solr.common.SolrException: Could not load collection from
ZK:qa_eu-west-1_here_account
at
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257)
.
.
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/collections/qa_eu-west-1_here_account/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
{noformat}
Is this behavior (lack of self-healing) a known and expected behavior?
If this is expected behavior then likely this should be recast as an
Improvement request?
Is this the same or similar behavior as documented here
https://issues.apache.org/jira/browse/SOLR-5129
p.s. I can add Solr log files if they will help
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]