Node not recovering, leader elections not occuring

Tom Evans Tue, 19 Jul 2016 07:42:25 -0700

Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
of the collections on it marked as "Recovering" or "Recovery Failed".
It attempts to recover from the leader, but the leader responds with:


Error while trying to recover.
core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:30000/solr: We are not the
leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:30000/solr: We are not the
leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
... 5 more

and recovery never occurs.

Each collection in this state has plenty (10+) of active replicas, but
stopping the server that is marked as the leader doesn't trigger a
leader election amongst these replicas.

REBALANCELEADERS did nothing.
FORCELEADER complains that there is already a leader.
FORCELEADER with the purported leader stopped took 45 seconds,
reported status of "0" (and no other message) and kept the down node
as the leader (!)
Deleting the failed collection from the failed node and re-adding it
has the same "Leader said I'm not the leader" error message.

Any other ideas?

Cheers

Tom

Node not recovering, leader elections not occuring

Reply via email to