[
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416784#comment-16416784
]
Cao Manh Dat edited comment on SOLR-12066 at 3/28/18 4:36 AM:
--------------------------------------------------------------
Attached patch for this ticket
* remove core's data
* test
* making the exception log less verbose (new format below)
{quote}
26283 ERROR
(coreContainerWorkExecutor-42-thread-1-processing-n:127.0.0.1:52836_solr)
[n:127.0.0.1:52836_solr ] o.a.s.c.CoreContainer Error waiting for SolrCore
to be loaded on startup
org.apache.solr.cloud.ZkController$NotInClusterStateException: coreNodeName
core_node3 does not exist in shard shard1, ignore the exception if the replica
was deleted
at
org.apache.solr.cloud.ZkController.checkStateInZk(ZkController.java:1739)
~[java/:?]
at
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1637)
~[java/:?]
at
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1037)
~[java/:?]
at
org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:644)
~[java/:?]
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
~[metrics-core-3.2.2.jar:3.2.2]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_151]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
~[java/:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_151]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
{quote}
was (Author: caomanhdat):
Attached patch for this ticket
* remove core's data
* test
* making the exception log less verbose (new format below)
{quote}
26192 ERROR
(coreContainerWorkExecutor-42-thread-1-processing-n:127.0.0.1:52489_solr)
[n:127.0.0.1:52489_solr ] o.a.s.c.CoreContainer Error waiting for SolrCore
to be created
java.util.concurrent.ExecutionException:
org.apache.solr.cloud.ZkController$NotInClusterStateException: coreNodeName
core_node4 does not exist in shard shard1, ignore the exception if the replica
was deleted
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
~[?:1.8.0_151]
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
~[?:1.8.0_151]
at
org.apache.solr.core.CoreContainer.lambda$load$14(CoreContainer.java:673)
~[java/:?]
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
~[metrics-core-3.2.2.jar:3.2.2]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_151]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_151]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
~[java/:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_151]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.apache.solr.cloud.ZkController$NotInClusterStateException:
coreNodeName core_node4 does not exist in shard shard1, ignore the exception if
the replica was deleted
at
org.apache.solr.cloud.ZkController.checkStateInZk(ZkController.java:1739)
~[java/:?]
at
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1637)
~[java/:?]
at
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1037)
~[java/:?]
at
org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:644)
~[java/:?]
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
~[metrics-core-3.2.2.jar:3.2.2]
... 5 more
{quote}
> Autoscaling move replica can cause core initialization failure on the
> original JVM
> ----------------------------------------------------------------------------------
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: AutoScaling, SolrCloud
> Reporter: Varun Thacker
> Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-12066.patch
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in
> ZK for only 3 seconds was the culprit for cores not loading up
>
> But it turns out to be something else. Here are the steps to reproduce this
> problem
>
> - create a 3 node cluster
> - create a 1 shard X 2 replica collection to use node1 and node2 (
> [http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true]
> )
> - stop node 2 : ./bin/solr stop -p 7574
> - Solr will create a new replica on node3 after 30 seconds because of the
> ".auto_add_replicas" trigger
> - At this point state.json has info about replicas being on node1 and node3
> - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException:
> org.apache.solr.common.SolrException: Unable to create core
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core
> [test_node_lost_shard1_replica_n2]
> at
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException:
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does
> not exist in shard shard1:
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>
> The practical effects of this is not big since the move replica has already
> put the replica on another JVM . But to the user it's super confusing on
> what's happening. He can never get rid of this error unless he manually
> cleans up the data directory on node2 and restart
>
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could
> be using a node lost trigger and and run into the same issue
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]