There are 11 collections, each only has one shard, and each node has 10 replicas (9 collections are on every node, 2 are just on one node). We're not seeing any OOM errors on restart.
I think we're being patient waiting for the leader election to occur. We stopped the troublesome "leader that is not the leader" server about 15-20 minutes ago, but we still have not had a leader election. Cheers Tom On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> wrote: > How many replicas per Solr JVM? And do you > see any OOM errors when you bounce a server? > And how patient are you being, because it can > take 3 minutes for a leaderless shard to decide > it needs to elect a leader. > > See SOLR-7280 and SOLR-7191 for the case > where lots of replicas are in the same JVM, > the tell-tale symptom is errors in the log as you > bring Solr up saying something like > "OutOfMemory error.... unable to create native thread" > > SOLR-7280 has patches for 6x and 7x, with a 5x one > being added momentarily. > > Best, > Erick > > On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote: >> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most >> of the collections on it marked as "Recovering" or "Recovery Failed". >> It attempts to recover from the leader, but the leader responds with: >> >> Error while trying to recover. >> core=iris_shard1_replica1:java.util.concurrent.ExecutionException: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://172.31.1.171:30000/solr: We are not the >> leader >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at >> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) >> at >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) >> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://172.31.1.171:30000/solr: We are not the >> leader >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) >> ... 5 more >> >> and recovery never occurs. >> >> Each collection in this state has plenty (10+) of active replicas, but >> stopping the server that is marked as the leader doesn't trigger a >> leader election amongst these replicas. >> >> REBALANCELEADERS did nothing. >> FORCELEADER complains that there is already a leader. >> FORCELEADER with the purported leader stopped took 45 seconds, >> reported status of "0" (and no other message) and kept the down node >> as the leader (!) >> Deleting the failed collection from the failed node and re-adding it >> has the same "Leader said I'm not the leader" error message. >> >> Any other ideas? >> >> Cheers >> >> Tom