[
https://issues.apache.org/jira/browse/SOLR-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022607#comment-18022607
]
Chris M. Hostetter commented on SOLR-3696:
------------------------------------------
For reasons i don't understand, my test reports show that
{{TestReplicaProperties}} has been failing in ~50% of all jenkins builds over
the past 7 days – but only on the "main" branch (since it's a suite level test
failure which can record more failures then runs, historic metrics aren't
available because they are unreliably – so no idea how long it's been failing).
Based on a sampling of the jenkins log, the problem is always leaked
{{aliveCheckExecutor}} threads.
----
These seeds don't *reliably* reproduce for me, but they do _occasionally_
reproduce – and since neither jenkins nor any of my local failures included any
{{ObjectReleaseTracker}} failures regarding the {{LBSolrClient}} instance, that
strongly suggests that the {{LBSolrClient}} *IS* being properly closed, but
{{LBSolrClient}} is _*not*_ (reliably) closing the {{{}aliveCheckExecutor{}}}.
Reviewing the lifecycle of the {{aliveCheckExecutor}} I had a hunch, and with a
small amount of instrumentation I was able to confirm what the problem is:
Nothing prevents a thread from executing a request via {{{}LBSolrClient{}}},
which may then "fail" and cause {{LBSolrClient.startAliveCheckExecutor}} to be
called, *AFTER* {{LBSolrClient.close()}} has been (concurrently) called by some
other thread.
{noformat}
// LBSolrClient.aliveCheckExecutor is initially null
T1: sends a request via the LBSolrClient, which fails and enters
LBSolrClient.startAliveCheckExecutor
T2: enters LBSolrClient.close, synchronizes on LBSolrClient.this
T2: sees that aliveCheckExecutor is null, does nothing, returns (releasing
synchronization lock on LBSolrClient.this)
T1: sees that aliveCheckExecutor is null, synchronizes on LBSolrClient.this
T1: double checks that aliveCheckExecutor is still null, assigns it a new
Executor, returns (releasing synchronization lock on LBSolrClient.this)
{noformat}
We need to make {{LBSolrClient.startAliveCheckExecutor()}} smart enough to not
initialize the Executor if the {{LBSolrClient.close()}} has already been (or is
concurrently being) called.
> LBHttpSolrServer's aliveCheckExecutor is not closed in RecoveryZkTest (and
> possibly other tests)
> ------------------------------------------------------------------------------------------------
>
> Key: SOLR-3696
> URL: https://issues.apache.org/jira/browse/SOLR-3696
> Project: Solr
> Issue Type: Bug
> Reporter: Dawid Weiss
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> LBHttpSolrServer is never shut down properly and leaks pool threads from
> aliveCheckExecutor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]