Jan Høydahl created SOLR-18174:
----------------------------------

             Summary: AsyncTracker Semaphore leak on LBAsyncSolrClient retries
                 Key: SOLR-18174
                 URL: https://issues.apache.org/jira/browse/SOLR-18174
             Project: Solr
          Issue Type: Bug
          Components: SolrJ
            Reporter: Jan Høydahl
            Assignee: Jan Høydahl


Experienced complete deadlocked Solr 9.10.1 distributed requests several times 
in production, once every copule of days. A Solr restart resolved the issue. 
This started happending immediately after upgrading from Solr 9.7 to 9.10.

I had Claude make an analysis of what could be happening, see 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] . 
This identifies several code changes related to distributed search between 
those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related to 
changed behavior with cancelAll and request.abort during aborted or failed 
queries, which could lead to a semaphore leak, at least temporarily for 10 min.

Later we were able to catch an internal test environment in the failure state, 
and were able to make tread dumps for the two nodes in the cluster (attached). 
Analyzing these with Claude identified another failure mode: LBHttp2SolrClient 
has a retry logic if the first request fails, and it will spawn a new request 
which obtains another Semaphore permit, without first releasing the permit 
obtained for the original query. Net result is that the original permit is 
leaked. A description of this failure scenario will be presented in a Pull 
Request which also shows reproduction and a fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to