When this has happened to me before I have had pretty good luck by
restarting the overseer leader, which can be found in zookeeper under
/overseer_elect/leader

If that doesn't work I've had to do more intrusive and manual recovery
methods, which suck.

On Tue, Jan 12, 2021 at 10:36 AM Pierre Salagnac
<pierre.salag...@gmail.com> wrote:
>
> Hello,
> We had a stuck leader election for a shard.
>
> We have collections with 2 shards, each shard has 5 replicas. We have many
> collections but the issue happened for a single shard. Once all host
> restarts completed, this shard was stuck with one replica is "recovery"
> state and all other is "down" state.
>
> Here is the state of the shard returned by CLUSTERSTATUS command.
>       "replicas":{
>         "core_node3":{
>           "core":"...._shard1_replica_n1",
>           "base_url":"https://host1:8983/solr";,
>           "node_name":"host1:8983_solr",
>           "state":"recovering",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node9":{
>           "core":"...._shard1_replica_n6",
>           "base_url":"https://host2:8983/solr";,
>           "node_name":"host2:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node26":{
>           "core":"...._shard1_replica_n25",
>           "base_url":"https://host3:8983/solr";,
>           "node_name":"host3:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node28":{
>           "core":"...._shard1_replica_n27",
>           "base_url":"https://host4:8983/solr";,
>           "node_name":"host4:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node34":{
>           "core":"...._shard1_replica_n33",
>           "base_url":"https://host5:8983/solr";,
>           "node_name":"host5:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"}}}
>
> The workarounds to shutdown server host1 with the replica stuck in recovery
> state. This unblocked leader election, the 4 other replicas went active.
>
> Here is the first error I found in logs related to this shard. It happened
> while shutting a server host3 that was the leader at that time/
>  (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
> consuming and closing http response stream. =>
> java.nio.channels.AsynchronousCloseException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> java.nio.channels.AsynchronousCloseException: null
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
>
> My understanding is following this error, each server restart ended in the
> replica on this server being in "down" state, but I'm not sure how to
> confirm that.
> We then entered in a loop where term is increased because of failed
> replication.
>
> Is this a know issue? I found no similar ticket in Jira.
> Could you please having a better understanding of the issue?
> Thanks

Reply via email to