[
https://issues.apache.org/jira/browse/SOLR-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl updated SOLR-18174:
-------------------------------
Description:
Experienced complete deadlocked Solr 9.10.1 distributed requests several times
in production, once every copule of days. A Solr restart resolved the issue.
This started happending immediately after upgrading from Solr 9.7 to 9.10.
I had Claude make an analysis of what could be happening, see
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] .
This identifies several code changes related to distributed search between
those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related to
changed behavior with cancelAll and request.abort during aborted or failed
queries, which could lead to a semaphore leak, at least temporarily for 10 min.
While we could reproduce such a scenario, it would only be a temporary leak as
permits would be released after timeout.
Later we were able to catch an internal test environment in the failure state,
and were able to make tread dumps for the two nodes in the cluster (attached).
Analyzing these with Claude identified another failure mode: LBHttp2SolrClient
has a retry logic if the first request fails, and it will spawn a new async
request which obtains another Semaphore permit, without first releasing the
permit obtained for the original query. Net result, if available permits is
already low, a permanent deadlock happens. I will attach a PR reproducing this
failure state, but it simulates a low number of permits as a prerequisite.
So the final piece of the puzzle is to demonstrate how Semaphore permits may
gradually leak over time to get to a state of low availability, which is a
prerequisite for the deadlock case described above. This is still TBD.
was:
Experienced complete deadlocked Solr 9.10.1 distributed requests several times
in production, once every copule of days. A Solr restart resolved the issue.
This started happending immediately after upgrading from Solr 9.7 to 9.10.
I had Claude make an analysis of what could be happening, see
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] .
This identifies several code changes related to distributed search between
those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related to
changed behavior with cancelAll and request.abort during aborted or failed
queries, which could lead to a semaphore leak, at least temporarily for 10 min.
Later we were able to catch an internal test environment in the failure state,
and were able to make tread dumps for the two nodes in the cluster (attached).
Analyzing these with Claude identified another failure mode: LBHttp2SolrClient
has a retry logic if the first request fails, and it will spawn a new request
which obtains another Semaphore permit, without first releasing the permit
obtained for the original query. Net result is that the original permit is
leaked. A description of this failure scenario will be presented in a Pull
Request which also shows reproduction and a fix.
> AsyncTracker Semaphore leak on LBAsyncSolrClient retries
> --------------------------------------------------------
>
> Key: SOLR-18174
> URL: https://issues.apache.org/jira/browse/SOLR-18174
> Project: Solr
> Issue Type: Bug
> Components: SolrJ
> Reporter: Jan Høydahl
> Assignee: Jan Høydahl
> Priority: Major
> Attachments: threads-test-node-0.json, threads-test-node-1.json
>
>
> Experienced complete deadlocked Solr 9.10.1 distributed requests several
> times in production, once every copule of days. A Solr restart resolved the
> issue. This started happending immediately after upgrading from Solr 9.7 to
> 9.10.
> I had Claude make an analysis of what could be happening, see
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977]
> . This identifies several code changes related to distributed search between
> those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related
> to changed behavior with cancelAll and request.abort during aborted or failed
> queries, which could lead to a semaphore leak, at least temporarily for 10
> min. While we could reproduce such a scenario, it would only be a temporary
> leak as permits would be released after timeout.
> Later we were able to catch an internal test environment in the failure
> state, and were able to make tread dumps for the two nodes in the cluster
> (attached). Analyzing these with Claude identified another failure mode:
> LBHttp2SolrClient has a retry logic if the first request fails, and it will
> spawn a new async request which obtains another Semaphore permit, without
> first releasing the permit obtained for the original query. Net result, if
> available permits is already low, a permanent deadlock happens. I will attach
> a PR reproducing this failure state, but it simulates a low number of permits
> as a prerequisite.
> So the final piece of the puzzle is to demonstrate how Semaphore permits may
> gradually leak over time to get to a state of low availability, which is a
> prerequisite for the deadlock case described above. This is still TBD.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]