[ 
https://issues.apache.org/jira/browse/SOLR-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118293#comment-14118293
 ] 

Rich Mayfield commented on SOLR-5990:
-------------------------------------

Thanks [~shalinmangar], I'll have to give that a try when I get a chance.

> SolrCloud with LotsOfCores does not come up fully
> -------------------------------------------------
>
>                 Key: SOLR-5990
>                 URL: https://issues.apache.org/jira/browse/SOLR-5990
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>            Reporter: Rich Mayfield
>
> I have ~800 collections, each with a replica in two Solr nodes (so a total of 
> ~1600 cores).
> Many times when I restart both nodes they both come up and within a few  
> minutes all cores are available and there are no problems.
> Some times, though, I'll see the following:
> After a few minutes a bunch of cores are still in a "down" state.
> In the logs, I see errors indicating failures for the "down" cores like this:
> {noformat}
> 2014-04-16 20:43:18,902 [coreZkRegister-1-thread-260] INFO  
> org.apache.solr.update.PeerSync - PeerSync: core=instance_5691 
> url=http://localhost:19082/solr START 
> replicas=[http://localhost:19081/solr/instance_5691/] nUpdates=100
> 2014-04-16 20:43:18,902 [coreZkRegister-1-thread-34] INFO  
> org.apache.solr.update.PeerSync - PeerSync: core=instance_5217 
> url=http://localhost:19082/solr DONE.  We have no versions.  sync failed.
> {noformat}
> After approximately 10 minutes of nothing really happening, the nodes "woke 
> up" and a few collections reconciled. There are a lot of failures that 
> indicate a retry will occur - every retry waits double the amount of time as 
> the previous wait.
> {noformat}
> 2014-04-16 20:56:37,515 [http-19081-82] ERROR 
> org.apache.solr.core.ZkContainer - :org.apache.solr.common.SolrException: 
> Error getting leader from zk for shard shard1
>         at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:887)
>         at org.apache.solr.cloud.ZkController.register(ZkController.java:790)
>         at org.apache.solr.cloud.ZkController.register(ZkController.java:738)
>         at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:263)
>         at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:287)
>         at 
> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:491)
>         at 
> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:444)
>         at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:847)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:309)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
>         at 
> veeva.ecm.common.interfaces.web.SolrDispatchOverride.doFilter(SolrDispatchOverride.java:34)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>         at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>         at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>         at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>         at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>         at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:554)
>         at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>         at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
>         at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>         at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>         at java.lang.Thread.run(Thread.java:744) 
> Caused by: org.apache.solr.common.SolrException: Could not get leader props
>         at 
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:934)
>         at 
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:898)
>         at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:853)
>         ... 23 more
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /collections/instance_4345_CONTENT/leaders/shard1
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:271)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:271)
>         at 
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:912)
>         ... 25 more
> {noformat}
> Note that zkClientTimeout is set to 30000. Later I see lots of the following
> {noformat}
> 2014-04-16 21:25:22,861 [RecoveryThread] ERROR 
> org.apache.solr.cloud.RecoveryStrategy - Recovery failed - trying again... 
> (10) core=instance_3007
> 2014-04-16 21:25:22,861 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy - Wait 600.0 seconds before trying to 
> recover again (11)
> 2014-04-16 21:25:24,222 [RecoveryThread] ERROR 
> org.apache.solr.cloud.RecoveryStrategy - Error while trying to recover. 
> core=instance_4345_CONTENT:org.apache.solr.common.SolrException: No 
> registered leader was found after waiting for 4000ms , collection: 
> instance_4345_CONTENT slice: shard1
>         at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:531)
>         at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:514)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:345) 
>         at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)
> {noformat}
> Eventually both nodes appear to have replicas in a recovery_failed state. The 
> following is retried dozens of times:
> {noformat}
> 2014-04-16 21:28:17,337 [RecoveryThread] ERROR 
> org.apache.solr.cloud.RecoveryStrategy - Error while trying to recover. 
> core=instance_4345_CONTENT:org.apache.solr.common.SolrException: No 
> registered leader was found after waiting for 4000ms , collection: 
> instance_4345_CONTENT slice: shard1
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:531)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:514)
>       at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:345)
>       at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)
> 2014-04-16 21:28:17,337 [RecoveryThread] ERROR 
> org.apache.solr.cloud.RecoveryStrategy - Recovery failed - trying again... 
> (0) core=instance_4345_CONTENT
> 2014-04-16 21:28:17,337 [RecoveryThread] ERROR 
> org.apache.solr.cloud.RecoveryStrategy - Recovery failed - interrupted. 
> core=instance_4345_CONTENT
> 2014-04-16 21:28:17,337 [RecoveryThread] ERROR 
> org.apache.solr.cloud.RecoveryStrategy - Recovery failed - I give up. 
> core=instance_4345_CONTENT
> 2014-04-16 21:28:17,337 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController - publishing core=instance_4345_CONTENT 
> state=recovery_failed collection=instance_4345_CONTENT
> 2014-04-16 21:28:17,337 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController - numShards not found on descriptor - 
> reading it from system property
> 2014-04-16 21:28:17,338 [RecoveryThread] WARN  
> org.apache.solr.cloud.RecoveryStrategy - Stopping recovery for 
> zkNodeName=localhost:19082_solr_instance_4345_CONTENTcore=instance_4345_CONTENT
> 2014-04-16 21:28:17,338 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy - Finished recovery process. 
> core=instance_4345_CONTENT
> 2014-04-16 21:28:17,339 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy - Starting recovery process.  
> core=instance_4345_CONTENT recoveringAfterStartup=false
> 2014-04-16 21:28:17,341 [main-EventThread] INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader process 
> for shard shard1
> 2014-04-16 21:28:17,343 [main-EventThread] INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - Checking if I should try 
> and be the leader.
> 2014-04-16 21:28:17,344 [main-EventThread] INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - My last published State 
> was recovery_failed, I won't be the leader.
> 2014-04-16 21:28:17,344 [main-EventThread] INFO  
> org.apache.solr.cloud.ShardLeaderElectionContext - There may be a better 
> leader candidate than us - going back into recovery
> 2014-04-16 21:28:17,344 [main-EventThread] INFO  
> org.apache.solr.update.DefaultSolrCoreState - Running recovery - first 
> canceling any ongoing recovery
> 2014-04-16 21:28:17,344 [main-EventThread] WARN  
> org.apache.solr.cloud.RecoveryStrategy - Stopping recovery for 
> zkNodeName=localhost:19082_solr_instance_4345_CONTENTcore=instance_4345_CONTENT
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to