[jira] [Updated] (SOLR-18174) AsyncTracker Semaphore leak on LBAsyncSolrClient retries

Jira Tue, 24 Mar 2026 02:58:23 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jan Høydahl updated SOLR-18174:
-------------------------------
    Description: 
Experienced complete deadlocked Solr 9.10.1 distributed requests several times 
in production, once every copule of days. A Solr restart resolved the issue. 
This started happending immediately after upgrading from Solr 9.7 to 9.10.

I had Claude make an analysis of what could be happening, see 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] . 
This identifies several code changes related to distributed search between 
those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related to 
changed behavior with cancelAll and request.abort during aborted or failed 
queries, which could lead to a semaphore leak, at least temporarily for 10 min. 
While we could reproduce such a scenario, it would only be a temporary leak as 
permits would be released after timeout.

Later we were able to catch an internal test environment in the failure state, 
and were able to make tread dumps for the two nodes in the cluster (attached). 
Analyzing these with Claude identified another failure mode: LBHttp2SolrClient 
has a retry logic if the first request fails, and it will spawn a new async 
request which obtains another Semaphore permit, without first releasing the 
permit obtained for the original query. Net result, if available permits is 
already low, a permanent deadlock happens. I will attach a PR reproducing this 
failure state, but it simulates a low number of permits as a prerequisite.

So the final piece of the puzzle is to demonstrate how Semaphore permits may 
gradually leak over time to get to a state of low availability, which is a 
prerequisite for the deadlock case described above. This is still TBD.

  was:
Experienced complete deadlocked Solr 9.10.1 distributed requests several times 
in production, once every copule of days. A Solr restart resolved the issue. 
This started happending immediately after upgrading from Solr 9.7 to 9.10.

I had Claude make an analysis of what could be happening, see 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] . 
This identifies several code changes related to distributed search between 
those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related to 
changed behavior with cancelAll and request.abort during aborted or failed 
queries, which could lead to a semaphore leak, at least temporarily for 10 min.

Later we were able to catch an internal test environment in the failure state, 
and were able to make tread dumps for the two nodes in the cluster (attached). 
Analyzing these with Claude identified another failure mode: LBHttp2SolrClient 
has a retry logic if the first request fails, and it will spawn a new request 
which obtains another Semaphore permit, without first releasing the permit 
obtained for the original query. Net result is that the original permit is 
leaked. A description of this failure scenario will be presented in a Pull 
Request which also shows reproduction and a fix.


> AsyncTracker Semaphore leak on LBAsyncSolrClient retries
> --------------------------------------------------------
>
>                 Key: SOLR-18174
>                 URL: https://issues.apache.org/jira/browse/SOLR-18174
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrJ
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Major
>         Attachments: threads-test-node-0.json, threads-test-node-1.json
>
>
> Experienced complete deadlocked Solr 9.10.1 distributed requests several 
> times in production, once every copule of days. A Solr restart resolved the 
> issue. This started happending immediately after upgrading from Solr 9.7 to 
> 9.10.
> I had Claude make an analysis of what could be happening, see 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] 
> . This identifies several code changes related to distributed search between 
> those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related 
> to changed behavior with cancelAll and request.abort during aborted or failed 
> queries, which could lead to a semaphore leak, at least temporarily for 10 
> min. While we could reproduce such a scenario, it would only be a temporary 
> leak as permits would be released after timeout.
> Later we were able to catch an internal test environment in the failure 
> state, and were able to make tread dumps for the two nodes in the cluster 
> (attached). Analyzing these with Claude identified another failure mode: 
> LBHttp2SolrClient has a retry logic if the first request fails, and it will 
> spawn a new async request which obtains another Semaphore permit, without 
> first releasing the permit obtained for the original query. Net result, if 
> available permits is already low, a permanent deadlock happens. I will attach 
> a PR reproducing this failure state, but it simulates a low number of permits 
> as a prerequisite.
> So the final piece of the puzzle is to demonstrate how Semaphore permits may 
> gradually leak over time to get to a state of low availability, which is a 
> prerequisite for the deadlock case described above. This is still TBD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-18174) AsyncTracker Semaphore leak on LBAsyncSolrClient retries

Reply via email to