[
https://issues.apache.org/jira/browse/SOLR-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Per Steffensen updated SOLR-3561:
---------------------------------
Description:
Running several Solr servers in Cloud-cluster (zkHost set on the Solr servers).
Several collections with several slices and one replica for each slice (each
slice has to shards)
Basically we want let our system delete an entire collection. We do this by
trying to delete each and every shard under the collection. Each shard is
deleted one by one, by doing CoreAdmin-UNLOAD-requests against the relevant Solr
{code}
CoreAdminRequest request = new CoreAdminRequest();
request.setAction(CoreAdminAction.UNLOAD);
request.setCoreName(shardName);
CoreAdminResponse resp = request.process(new CommonsHttpSolrServer(solrUrl));
{code}
The delete/unload succeeds, but in like 10% of the cases we get errors on
involved Solr servers, right around the time where shard/cores are deleted, and
we end up in a situation where ZK still claims (forever) that the deleted shard
is still present and active.
Form here the issue is easilier explained by a more concrete example:
- 7 Solr servers involved
- Several collection a.o. one called "collection_2012_04", consisting of 28
slices, 56 shards (remember 1 replica for each slice) named
"collection_2012_04_sliceX_shardY" for all pairs in {X:1..28}x{Y:1,2}
- Each Solr server running 8 shards, e.g Solr server #1 is running shard
"collection_2012_04_slice1_shard1" and Solr server #7 is running shard
"collection_2012_04_slice1_shard2" belonging to the same slice "slice1".
When we decide to delete the collection "collection_2012_04" we go through all
56 shards and delete/unload them one-by-one - including
"collection_2012_04_slice1_shard1" and "collection_2012_04_slice1_shard2". At
some point during or shortly after all this deletion we see the following
exceptions in solr.log on Solr server #7
{code}
Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
SEVERE: Error while trying to recover:org.apache.solr.common.SolrException:
core not found:collection_2012_04_slice1_shard1
request:
http://solr_server_1:8983/solr/admin/cores?action=PREPRECOVERY&core=collection_2012_04_slice1_shard1&nodeName=solr_server_7%3A8983_solr&coreNodeName=solr_server_7%3A8983_solr_collection_2012_04_slice1_shard2&state=recovering&checkLive=true&pauseFor=6000&wt=javabin&version=2
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.solr.common.SolrExceptionPropagationHelper.decodeFromMsg(SolrExceptionPropagationHelper.java:29)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:445)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:188)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:285)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206)
Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
SEVERE: Recovery failed - trying again...
Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
WARNING:
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
{code}
Im not sure exactly how to interpret this, but it seems to me that some
recovery job tries to recover collection_2012_04_slice1_shard2 on Solr server
#7 from collection_2012_04_slice1_shard1 on Solr server #1, but fail because
Solr server #1 answers back that it doesnt run collection_2012_04_slice1_shard1
(anymore).
This problem occurs for serveral (in this conrete test for 4) of the 28 slice
pairs. For those 4 shards the end result is that
/node_states/solr_server_X:8983_solr in ZK still contains information about the
shard being running and active. E.g. /node_states/solr_server_7:8983_solr still
contains
{code}
{
"shard":"slice1",
"state":"active",
"core":"collection_2012_04_slice1_shard2",
"collection":"collection_2012_04",
"node_name":"solr_server_7:8983_solr",
"base_url":"http://solr_server_7:8983/solr"
}
{code}
and that CloudState therefore still reports that those shards are running and
active - but thay are not. A.o. I have noticed that
"collection_2012_04_slice1_shard2" HAS been removed from solr.xml on Solr
server #7 (we are running with persistent="true")
Any chance that this bug is fixed in a later revision (than one from 29/2-2012)
of 4.0-SNAPSHOT?
If not we need to get it fixed, I believe.
was:
Running several Solr servers in Cloud-cluster (zkHost set on the Solr servers).
Several collections with several slices and one replica for each slice (each
slice has to shards)
Basically we want let our system delete an entire collection. We do this by
trying to delete each and every shard under the collection. Each shard is
deleted one by one, by doing CoreAdmin-UNLOAD-requests against the relevant Solr
{code}
CoreAdminRequest request = new CoreAdminRequest();
request.setAction(CoreAdminAction.UNLOAD);
request.setCoreName(shardName);
CoreAdminResponse resp = request.process(new CommonsHttpSolrServer(solrUrl));
{code}
The delete/unload succeeds, but in like 10% of the cases we get errors on
involved Solr servers, right around the time where shard/cores are deleted, and
we end up in a situation where ZK still claims (forever) that the deleted shard
is still present and active.
Form here the issue is easilier explained by a more concrete example:
- 7 Solr servers involved
- Several collection a.o. one called "collection_2012_04", consisting of 28
slices, 56 shards (remember 1 replica for each slice) named
"collection_2012_04_sliceX_shardY" for all pairs in {X:1..28}x{Y:1,2}
- Each Solr server running 8 shards, e.g Solr server #1 is running shard
"collection_2012_04_slice1_shard1" and Solr server #7 is running shard
"collection_2012_04_slice1_shard2" belonging to the same slice "slice1".
When we decide to delete the collection "collection_2012_04" we go through all
56 shards and delete/unload them one-by-one - including
"collection_2012_04_slice1_shard1" and "collection_2012_04_slice1_shard2". At
some point during or shortly after all this deletion we see the following
exceptions in solr.log on Solr server #7
{code}
Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
SEVERE: Error while trying to recover:org.apache.solr.common.SolrException:
core not found:collection_2012_04_slice1_shard1
request:
http://solr_server_1:8983/solr/admin/cores?action=PREPRECOVERY&core=collection_2012_04_slice1_shard1&nodeName=solr_server_7%3A8983_solr&coreNodeName=solr_server_7%3A8983_solr_collection_2012_04_slice1_shard2&state=recovering&checkLive=true&pauseFor=6000&wt=javabin&version=2
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.solr.common.SolrExceptionPropagationHelper.decodeFromMsg(SolrExceptionPropagationHelper.java:29)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:445)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:188)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:285)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206)
Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
SEVERE: Recovery failed - trying again...
Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
WARNING:
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
{code}
Im not sure exactly how to interpret this, but it seems to me that some
recovery job tries to recover collection_2012_04_slice1_shard2 on Solr server
#7 from collection_2012_04_slice1_shard1 on Solr server #1, but fail because
Solr server #1 answers back that it doesnt run collection_2012_04_slice1_shard1
(anymore).
This problem occurs for serveral (in this conrete test for 4) of the 28 slice
pairs. For those 4 shards the end result is that
/node_states/solr_server_7:8983_solr in ZK still contains
{code}
{
"shard":"slice1",
"state":"active",
"core":"collection_2012_04_slice1_shard2",
"collection":"collection_2012_04",
"node_name":"solr_server_7:8983_solr",
"base_url":"http://solr_server_7:8983/solr"
}
{code}
and that CloudState therefore still reports that this shard is running and
active - but it is not. A.o. I have noticed that
"collection_2012_04_slice1_shard2" HAS been removed from solr.xml on Solr
server #7 (we are running with persistent="true")
Any chance that this bug is fixed in a later revision (than one from 29/2-2012)
of 4.0-SNAPSHOT?
If not we need to get it fixed, I believe.
> Error during deletion of shard/core
> -----------------------------------
>
> Key: SOLR-3561
> URL: https://issues.apache.org/jira/browse/SOLR-3561
> Project: Solr
> Issue Type: Bug
> Components: multicore, replication (java), SolrCloud
> Affects Versions: 4.0
> Environment: Solr trunk (4.0-SNAPSHOT) from 29/2-2012
> Reporter: Per Steffensen
>
> Running several Solr servers in Cloud-cluster (zkHost set on the Solr
> servers).
> Several collections with several slices and one replica for each slice (each
> slice has to shards)
> Basically we want let our system delete an entire collection. We do this by
> trying to delete each and every shard under the collection. Each shard is
> deleted one by one, by doing CoreAdmin-UNLOAD-requests against the relevant
> Solr
> {code}
> CoreAdminRequest request = new CoreAdminRequest();
> request.setAction(CoreAdminAction.UNLOAD);
> request.setCoreName(shardName);
> CoreAdminResponse resp = request.process(new CommonsHttpSolrServer(solrUrl));
> {code}
> The delete/unload succeeds, but in like 10% of the cases we get errors on
> involved Solr servers, right around the time where shard/cores are deleted,
> and we end up in a situation where ZK still claims (forever) that the deleted
> shard is still present and active.
> Form here the issue is easilier explained by a more concrete example:
> - 7 Solr servers involved
> - Several collection a.o. one called "collection_2012_04", consisting of 28
> slices, 56 shards (remember 1 replica for each slice) named
> "collection_2012_04_sliceX_shardY" for all pairs in {X:1..28}x{Y:1,2}
> - Each Solr server running 8 shards, e.g Solr server #1 is running shard
> "collection_2012_04_slice1_shard1" and Solr server #7 is running shard
> "collection_2012_04_slice1_shard2" belonging to the same slice "slice1".
> When we decide to delete the collection "collection_2012_04" we go through
> all 56 shards and delete/unload them one-by-one - including
> "collection_2012_04_slice1_shard1" and "collection_2012_04_slice1_shard2". At
> some point during or shortly after all this deletion we see the following
> exceptions in solr.log on Solr server #7
> {code}
> Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover:org.apache.solr.common.SolrException:
> core not found:collection_2012_04_slice1_shard1
> request:
> http://solr_server_1:8983/solr/admin/cores?action=PREPRECOVERY&core=collection_2012_04_slice1_shard1&nodeName=solr_server_7%3A8983_solr&coreNodeName=solr_server_7%3A8983_solr_collection_2012_04_slice1_shard2&state=recovering&checkLive=true&pauseFor=6000&wt=javabin&version=2
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at
> org.apache.solr.common.SolrExceptionPropagationHelper.decodeFromMsg(SolrExceptionPropagationHelper.java:29)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:445)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264)
> at
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:188)
> at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:285)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206)
> Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
> SEVERE: Recovery failed - trying again...
> Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
> WARNING:
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
> at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
> at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
> {code}
> Im not sure exactly how to interpret this, but it seems to me that some
> recovery job tries to recover collection_2012_04_slice1_shard2 on Solr server
> #7 from collection_2012_04_slice1_shard1 on Solr server #1, but fail because
> Solr server #1 answers back that it doesnt run
> collection_2012_04_slice1_shard1 (anymore).
> This problem occurs for serveral (in this conrete test for 4) of the 28 slice
> pairs. For those 4 shards the end result is that
> /node_states/solr_server_X:8983_solr in ZK still contains information about
> the shard being running and active. E.g. /node_states/solr_server_7:8983_solr
> still contains
> {code}
> {
> "shard":"slice1",
> "state":"active",
> "core":"collection_2012_04_slice1_shard2",
> "collection":"collection_2012_04",
> "node_name":"solr_server_7:8983_solr",
> "base_url":"http://solr_server_7:8983/solr"
> }
> {code}
> and that CloudState therefore still reports that those shards are running and
> active - but thay are not. A.o. I have noticed that
> "collection_2012_04_slice1_shard2" HAS been removed from solr.xml on Solr
> server #7 (we are running with persistent="true")
> Any chance that this bug is fixed in a later revision (than one from
> 29/2-2012) of 4.0-SNAPSHOT?
> If not we need to get it fixed, I believe.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]