[ 
https://issues.apache.org/jira/browse/SOLR-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160156#comment-14160156
 ] 

Shalin Shekhar Mangar edited comment on SOLR-6591 at 10/6/14 11:37 AM:
-----------------------------------------------------------------------

What happens here is:
# The Stress Collection Creation thread in that test is trying to create 
collections (which have stateFormat=2)
# The overseer gets a "state" message from a new core created using core admin 
API. This should implicitly create a new collection:
{code}
   [junit4]   2> 561673 T45931 oasc.Overseer$ClusterStateUpdater.updateState 
Update state numShards=1 message={
   [junit4]   2>          "collection":"halfcollectionblocker",
   [junit4]   2>          "base_url":"http://127.0.0.1:42021";,
   [junit4]   2>          "state":"down",
   [junit4]   2>          "numShards":"1",
   [junit4]   2>          "node_name":"127.0.0.1:42021_",
   [junit4]   2>          "roles":null,
   [junit4]   2>          "shard":null,
   [junit4]   2>          "operation":"state",
   [junit4]   2>          "core":"halfcollection_shard1_replica1"}
   [junit4]   2> 561674 T45931 
oasc.Overseer$ClusterStateUpdater.createCollection Create collection 
halfcollectionblocker with shards [shard1]
   [junit4]   2> 561674 T45931 
oasc.Overseer$ClusterStateUpdater.createCollection state version 
halfcollectionblocker 1
   [junit4]   2> 561679 T45931 oasc.Overseer$ClusterStateUpdater.updateState 
Assigning new node to shard shard=shard1
{code}
# Right after the above message, the overseer gets a message to create 
'awholynewstresscollection_collection4_1' (I'm assuming through a "state" 
message). This fails with the following message:
{code}
   [junit4]   2> 561682 T45931 oasc.Overseer$ClusterStateUpdater.run ERROR 
Exception in Overseer main queue loop 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /collections/awholynewstresscollection_collection4_1/state.json
   [junit4]   2>        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   [junit4]   2>        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   [junit4]   2>        at 
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:382)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:379)
   [junit4]   2>        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:379)
   [junit4]   2>        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.updateZkStates(Overseer.java:358)
   [junit4]   2>        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:311)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2>
{code}
# This exception causes the "state" messaged executed for 
'halfcollectionblocker' collection to be lost. The message is still present in 
the work queue but because the overseer is healthy, it will continue to execute 
the main queue.
{code}
   [junit4]   2> 881993 T46259 oasc.ZkController.waitForShardId waiting to find 
shard id in clusterstate for halfcollection_shard1_replica1
   [junit4]   2> 1202711 T46259 oasc.CoreContainer.create ERROR Error creating 
core [halfcollection_shard1_replica1]: Could not get shard id for core: 
halfcollection_shard1_replica1 org.apache.solr.common.SolrException: Could not 
get shard id for core: halfcollection_shard1_replica1
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1425)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1371)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1513)
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:504)
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:484)
   [junit4]   2>        at 
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
{code}


was (Author: shalinmangar):
What happens here is:
# The Stress Collection Creation thread in that test is going on trying to 
create collections (which have stateFormat=2)
# The overseer gets a "state" message from a new core created using core admin 
API. This should implicitly create a new collection:
{code}
   [junit4]   2> 561673 T45931 oasc.Overseer$ClusterStateUpdater.updateState 
Update state numShards=1 message={
   [junit4]   2>          "collection":"halfcollectionblocker",
   [junit4]   2>          "base_url":"http://127.0.0.1:42021";,
   [junit4]   2>          "state":"down",
   [junit4]   2>          "numShards":"1",
   [junit4]   2>          "node_name":"127.0.0.1:42021_",
   [junit4]   2>          "roles":null,
   [junit4]   2>          "shard":null,
   [junit4]   2>          "operation":"state",
   [junit4]   2>          "core":"halfcollection_shard1_replica1"}
   [junit4]   2> 561674 T45931 
oasc.Overseer$ClusterStateUpdater.createCollection Create collection 
halfcollectionblocker with shards [shard1]
   [junit4]   2> 561674 T45931 
oasc.Overseer$ClusterStateUpdater.createCollection state version 
halfcollectionblocker 1
   [junit4]   2> 561679 T45931 oasc.Overseer$ClusterStateUpdater.updateState 
Assigning new node to shard shard=shard1
{code}
# Right after the above message, the overseer gets a message to create 
'awholynewstresscollection_collection4_1' (I'm assuming through a "state" 
message). This fails with the following message:
{code}
   [junit4]   2> 561682 T45931 oasc.Overseer$ClusterStateUpdater.run ERROR 
Exception in Overseer main queue loop 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /collections/awholynewstresscollection_collection4_1/state.json
   [junit4]   2>        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   [junit4]   2>        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   [junit4]   2>        at 
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:382)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:379)
   [junit4]   2>        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:379)
   [junit4]   2>        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.updateZkStates(Overseer.java:358)
   [junit4]   2>        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:311)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2>
{code}
# This exception causes the "state" messaged executed for 
'halfcollectionblocker' collection to be lost. The message is still present in 
the work queue but because the overseer is healthy and it will continue to 
execute the main queue.
{code}
   [junit4]   2> 881993 T46259 oasc.ZkController.waitForShardId waiting to find 
shard id in clusterstate for halfcollection_shard1_replica1
   [junit4]   2> 1202711 T46259 oasc.CoreContainer.create ERROR Error creating 
core [halfcollection_shard1_replica1]: Could not get shard id for core: 
halfcollection_shard1_replica1 org.apache.solr.common.SolrException: Could not 
get shard id for core: halfcollection_shard1_replica1
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1425)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1371)
   [junit4]   2>        at 
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1513)
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:504)
   [junit4]   2>        at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:484)
   [junit4]   2>        at 
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
{code}

> Cluster state updates can be lost on exception in main queue loop
> -----------------------------------------------------------------
>
>                 Key: SOLR-6591
>                 URL: https://issues.apache.org/jira/browse/SOLR-6591
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: Trunk
>            Reporter: Shalin Shekhar Mangar
>             Fix For: Trunk
>
>
> I found this bug while going through the failure on jenkins:
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/648/
> {code}
> 2 tests failed.
> REGRESSION:  
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testDistribSearch
> Error Message:
> Error CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create 
> core [halfcollection_shard1_replica1] Caused by: Could not get shard id for 
> core: halfcollection_shard1_replica1
> Stack Trace:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error 
> CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create core 
> [halfcollection_shard1_replica1] Caused by: Could not get shard id for core: 
> halfcollection_shard1_replica1
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570)
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
>         at 
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testErrorHandling(CollectionsAPIDistributedZkTest.java:583)
>         at 
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.doTest(CollectionsAPIDistributedZkTest.java:205)
>         at 
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to