When this has happened to me before I have had pretty good luck by restarting the overseer leader, which can be found in zookeeper under /overseer_elect/leader
If that doesn't work I've had to do more intrusive and manual recovery methods, which suck. On Tue, Jan 12, 2021 at 10:36 AM Pierre Salagnac <pierre.salag...@gmail.com> wrote: > > Hello, > We had a stuck leader election for a shard. > > We have collections with 2 shards, each shard has 5 replicas. We have many > collections but the issue happened for a single shard. Once all host > restarts completed, this shard was stuck with one replica is "recovery" > state and all other is "down" state. > > Here is the state of the shard returned by CLUSTERSTATUS command. > "replicas":{ > "core_node3":{ > "core":"...._shard1_replica_n1", > "base_url":"https://host1:8983/solr", > "node_name":"host1:8983_solr", > "state":"recovering", > "type":"NRT", > "force_set_state":"false"}, > "core_node9":{ > "core":"...._shard1_replica_n6", > "base_url":"https://host2:8983/solr", > "node_name":"host2:8983_solr", > "state":"down", > "type":"NRT", > "force_set_state":"false"}, > "core_node26":{ > "core":"...._shard1_replica_n25", > "base_url":"https://host3:8983/solr", > "node_name":"host3:8983_solr", > "state":"down", > "type":"NRT", > "force_set_state":"false"}, > "core_node28":{ > "core":"...._shard1_replica_n27", > "base_url":"https://host4:8983/solr", > "node_name":"host4:8983_solr", > "state":"down", > "type":"NRT", > "force_set_state":"false"}, > "core_node34":{ > "core":"...._shard1_replica_n33", > "base_url":"https://host5:8983/solr", > "node_name":"host5:8983_solr", > "state":"down", > "type":"NRT", > "force_set_state":"false"}}} > > The workarounds to shutdown server host1 with the replica stuck in recovery > state. This unblocked leader election, the 4 other replicas went active. > > Here is the first error I found in logs related to this shard. It happened > while shutting a server host3 that was the leader at that time/ > (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25 > r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26 > x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error > consuming and closing http response stream. => > java.nio.channels.AsynchronousCloseException > at > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316) > java.nio.channels.AsynchronousCloseException: null > at > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316) > at java.io.InputStream.read(InputStream.java:205) ~[?:?] > at > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176) > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > > My understanding is following this error, each server restart ended in the > replica on this server being in "down" state, but I'm not sure how to > confirm that. > We then entered in a loop where term is increased because of failed > replication. > > Is this a know issue? I found no similar ticket in Jira. > Could you please having a better understanding of the issue? > Thanks