Varun Thacker created SOLR-12066:
------------------------------------

             Summary: Autoscaling move replica can cause core initialization 
failure on the original JVM
                 Key: SOLR-12066
                 URL: https://issues.apache.org/jira/browse/SOLR-12066
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Varun Thacker


Initially when SOLR-12047 was created it looked like waiting for a state in ZK 
for only 3 seconds was the culprit for cores not loading up

 

But it turns out to be something else. Here are the steps to reproduce this 
problem

 
 - create a 3 node cluster
 - create a 1 shard X 2 replica collection to use node1 and node2 ( 
[http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true]
 )
 - stop node 2 : ./bin/solr stop -p 7574
 - Solr will create a new replica on node3 after 30 seconds because of the 
".auto_add_replicas" trigger
 - At this point state.json has info about replicas being on node1 and node3
 - Start node2. Bam!
{code:java}
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core [test_node_lost_shard1_replica_n2]
...
Caused by: org.apache.solr.common.SolrException: Unable to create core 
[test_node_lost_shard1_replica_n2]
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
...
Caused by: org.apache.solr.common.SolrException: 
at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
...
Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
not exist in shard shard1: 
DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
...{code}

 

The practical effects of this is not big since the move replica has already put 
the replica on another JVM . But to the user it's super confusing on what's 
happening. He can never get rid of this error unless he manually cleans up the 
data directory on node2 and restart



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to