[ 
https://issues.apache.org/jira/browse/SOLR-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539182#comment-16539182
 ] 

Varun Thacker commented on SOLR-10720:
--------------------------------------

Okay so the user was HttpClusterStateProvider / CloudSolrClient#withSolrUrl 
which is why when they hit this issue in Solr 7.2.1 create collections would 
also fail. 
{code:java}
Exception in thread "main" java.lang.RuntimeException: Couldn't initialize a 
HttpClusterStateProvider (is/are the Solr server(s), [http://host:port/solr/], 
down?)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.build(CloudSolrClient.java:1496)
...
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://host:port/solr: Error loading config name for collection 
my_collection_name
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at 
org.apache.solr.client.solrj.impl.HttpClusterStateProvider.fetchLiveNodes(HttpClusterStateProvider.java:189)
at 
org.apache.solr.client.solrj.impl.HttpClusterStateProvider.<init>(HttpClusterStateProvider.java:64)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.build(CloudSolrClient.java:1494)
{code}

> Aggressive removal of a collection breaks cluster state
> -------------------------------------------------------
>
>                 Key: SOLR-10720
>                 URL: https://issues.apache.org/jira/browse/SOLR-10720
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.5.1
>            Reporter: Alexey Serba
>            Assignee: Shalin Shekhar Mangar
>            Priority: Major
>             Fix For: 7.3, master (8.0)
>
>         Attachments: SOLR-10720.patch
>
>
> We are periodically seeing tricky concurrency bug in SolrCloud that starts 
> with `Could not fully remove collection: my_collection` exception:
> {noformat}
> 2017-05-17T14:47:50,153 - ERROR 
> [OverseerThreadFactory-6-thread-5:SolrException@159] - {} - Collection: 
> my_collection operation: delete failed:org.apache.solr.common.SolrException: 
> Could not fully remove collection: my_collection
>         at 
> org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:106)
>         at 
> org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:224)
>         at 
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:463)
> {noformat}
> After that all operations with SolrCloud that involve reading cluster state 
> fail with
> {noformat}
> org.apache.solr.common.SolrException: Error loading config name for 
> collection my_collection
>     at 
> org.apache.solr.common.cloud.ZkStateReader.readConfigName(ZkStateReader.java:198)
>     at 
> org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:141)
> ...
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /collections/my_collection
> ...
> {noformat}
> See full 
> [stacktraces|https://gist.github.com/serba/9b7932f005f34f6cd9a511e226c6f0c6]
> As a result SolrCloud becomes completely broken. We are seeing this with 
> 6.5.1 but I think we’ve seen that with older versions too.
> From looking into the code it looks like it is a combination of two factors:
> * Forcefully removing collection's znode in finally block in 
> [DeleteCollectionCmd|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L115]
>  that was introduced in SOLR-5135. Note that this causes cached cluster state 
> to be not in sync with the state in Zk, i.e. 
> {{zkStateReader.getClusterState()}} still has collection in it (see the code 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L98])
>  whereas {{/collections/<collection_id>}} znode in Zk is already removed.
> * Reading cluster state operation not only returns cached version, but it is 
> also reading collection's config name from {{/collections/<collection_id>}} 
> znode, but this znode was forcefully removed. The code to read config name 
> for every collection directly from Zk was introduced in SOLR-7636. Isn't 
> there any performance implications of reading N znodes (1 per collection) on 
> every {{getClusterStatus}} call? 
> I'm not sure what the proper fix should be
> * Should we just catch {{KeeperException$NoNodeException}} in 
> {{getClusterStatus}} and treat such collection as removed? That looks easiest 
> / less invasive fix.
> * Should we stop reading config name from collection znode and get it from 
> cache somehow?
> * Should we not try to delete collection's data from Zk if delete operation 
> failed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to