We have scripts that use the Solr Replica management APIs. The scripts use
the async parameter and poll for it to be finished.

Fairly regularly the DELETEREPLICA action will *never* finish.

I have eventually enabled enough logging to see that it is spinning on this:

> INFO
 (parallelCoreAdminExecutor-19-thread-4-processing-n:myHost:8984_solr
x:my_colleciton_shard105_0_replica_n2695 OFYOHGJY3554330096761208 UNLOAD) [
  ] o.a.s.c.SolrCore Core my_colleciton_shard105_0_replica_n2695 is not yet
closed, waiting 100 ms before checking again.

We have left this for tens of MINUTES (I see a recent example in our logs
of this spinning for 25 minutes) without it progressing on its own. When we
notice this we have restart the Solr process, which seems to correct the
state for practical purposes and move on. This manual intervention is very
painful.

The log statement appears to come from the SolrCore class, in the
closeAndWait
<https://github.com/apache/solr/blob/33b74e65caf46062737bbc6bc3507a39b1049f67/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1536-L1539>
method
(called by unload method). It has a while loop checking for `isClosed`. And
isClosed just checks if references are 0.

So the question is what could cause references to not go to zero for such a
long period of time? Any way to get visibility on what references are
remaining? Is this a known or documented issue anywhere?

Thanks

Reply via email to