[ https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cao Manh Dat updated SOLR-12066: -------------------------------- Attachment: SOLR-12066 > Autoscaling move replica can cause core initialization failure on the > original JVM > ---------------------------------------------------------------------------------- > > Key: SOLR-12066 > URL: https://issues.apache.org/jira/browse/SOLR-12066 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling, SolrCloud > Reporter: Varun Thacker > Priority: Major > Fix For: 7.4, master (8.0) > > Attachments: SOLR-12066 > > > Initially when SOLR-12047 was created it looked like waiting for a state in > ZK for only 3 seconds was the culprit for cores not loading up > > But it turns out to be something else. Here are the steps to reproduce this > problem > > - create a 3 node cluster > - create a 1 shard X 2 replica collection to use node1 and node2 ( > [http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true] > ) > - stop node 2 : ./bin/solr stop -p 7574 > - Solr will create a new replica on node3 after 30 seconds because of the > ".auto_add_replicas" trigger > - At this point state.json has info about replicas being on node1 and node3 > - Start node2. Bam! > {code:java} > java.util.concurrent.ExecutionException: > org.apache.solr.common.SolrException: Unable to create core > [test_node_lost_shard1_replica_n2] > ... > Caused by: org.apache.solr.common.SolrException: Unable to create core > [test_node_lost_shard1_replica_n2] > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053) > ... > Caused by: org.apache.solr.common.SolrException: > at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619) > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030) > ... > Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does > not exist in shard shard1: > DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={ > ...{code} > > The practical effects of this is not big since the move replica has already > put the replica on another JVM . But to the user it's super confusing on > what's happening. He can never get rid of this error unless he manually > cleans up the data directory on node2 and restart > > Please note: I chose autoAddReplicas=true to reproduce this. but a user could > be using a node lost trigger and and run into the same issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org