[ 
https://issues.apache.org/jira/browse/SOLR-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022607#comment-18022607
 ] 

Chris M. Hostetter commented on SOLR-3696:
------------------------------------------

For reasons i don't understand, my test reports show that 
{{TestReplicaProperties}} has been failing in ~50% of all jenkins builds over 
the past 7 days – but only on the "main" branch (since it's a suite level test 
failure which can record more failures then runs, historic metrics aren't 
available because they are unreliably – so no idea how long it's been failing).

Based on a sampling of the jenkins log, the problem is always leaked 
{{aliveCheckExecutor}} threads.

 
----
 

These seeds don't *reliably* reproduce for me, but they do _occasionally_ 
reproduce – and since neither jenkins nor any of my local failures included any 
{{ObjectReleaseTracker}} failures regarding the {{LBSolrClient}} instance, that 
strongly suggests that the {{LBSolrClient}} *IS* being properly closed, but 
{{LBSolrClient}} is _*not*_ (reliably) closing the {{{}aliveCheckExecutor{}}}.

Reviewing the lifecycle of the {{aliveCheckExecutor}} I had a hunch, and with a 
small amount of instrumentation I was able to confirm what the problem is: 
Nothing prevents a thread from executing a request via {{{}LBSolrClient{}}}, 
which may then "fail" and cause {{LBSolrClient.startAliveCheckExecutor}} to be 
called, *AFTER* {{LBSolrClient.close()}} has been (concurrently) called by some 
other thread.
{noformat}
    // LBSolrClient.aliveCheckExecutor is initially null
T1: sends a request via the LBSolrClient, which fails and enters 
LBSolrClient.startAliveCheckExecutor
T2: enters LBSolrClient.close, synchronizes on LBSolrClient.this
T2: sees that aliveCheckExecutor is null, does nothing, returns (releasing 
synchronization lock on LBSolrClient.this)
T1: sees that aliveCheckExecutor is null, synchronizes on LBSolrClient.this
T1: double checks that aliveCheckExecutor is still null, assigns it a new 
Executor, returns (releasing synchronization lock on LBSolrClient.this)
{noformat}
We need to make {{LBSolrClient.startAliveCheckExecutor()}} smart enough to not 
initialize the Executor if the {{LBSolrClient.close()}} has already been (or is 
concurrently being) called.

> LBHttpSolrServer's aliveCheckExecutor is not closed in RecoveryZkTest (and 
> possibly other tests)
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3696
>                 URL: https://issues.apache.org/jira/browse/SOLR-3696
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> LBHttpSolrServer is never shut down properly and leaks pool threads from 
> aliveCheckExecutor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to