Jan Høydahl created SOLR-18174:
----------------------------------
Summary: AsyncTracker Semaphore leak on LBAsyncSolrClient retries
Key: SOLR-18174
URL: https://issues.apache.org/jira/browse/SOLR-18174
Project: Solr
Issue Type: Bug
Components: SolrJ
Reporter: Jan Høydahl
Assignee: Jan Høydahl
Experienced complete deadlocked Solr 9.10.1 distributed requests several times
in production, once every copule of days. A Solr restart resolved the issue.
This started happending immediately after upgrading from Solr 9.7 to 9.10.
I had Claude make an analysis of what could be happening, see
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] .
This identifies several code changes related to distributed search between
those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related to
changed behavior with cancelAll and request.abort during aborted or failed
queries, which could lead to a semaphore leak, at least temporarily for 10 min.
Later we were able to catch an internal test environment in the failure state,
and were able to make tread dumps for the two nodes in the cluster (attached).
Analyzing these with Claude identified another failure mode: LBHttp2SolrClient
has a retry logic if the first request fails, and it will spawn a new request
which obtains another Semaphore permit, without first releasing the permit
obtained for the original query. Net result is that the original permit is
leaked. A description of this failure scenario will be presented in a Pull
Request which also shows reproduction and a fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]