Chris M. Hostetter created SOLR-16992:
-----------------------------------------

             Summary: Non-reproducible StreamingTest failures -- suggests 
CloudSolrStream concurency race condition
                 Key: SOLR-16992
                 URL: https://issues.apache.org/jira/browse/SOLR-16992
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter


Roughly 3% of all jenkins jobs that run {{StreamingTest}} wind up having suite 
level failures.

These failures have historically taken the form of 
{{com.carrotsearch.randomizedtesting.ThreadLeakError}} and the leaked threads 
all have names like
{{"h2sc-718-thread-2"}} indicating that they come from the internal 
{{ExecutorService}} of an {{{}Http2SolrClient{}}}.

In my experience, the seeds from these failures have never reproduced - 
suggesting that the problem is related to concurrency.

SOLR-16983 restored the (correct) use of {{ObjectReleaseTracker}} which in 
theory should help pinpoint where {{Http2SolrClient}} instances might not be 
getting closed (by causing {{ObjectReleaseTracker}} to fail with stacktraces of 
when/where any unclosed instances were created - ie: which test method)

In practice, I have managed to force one failure from {{StreamingTest}} since 
the SOLR-16983 changes (logs to be attached soon) - but it still didn't 
indicate any leaked/unclosed {{Http2SolrClient}} instances. What it instead 
indicated was a _single_ unclosed {{InputStream}} instance related to 
{{Http2SolrClient}} connections (SOLR-16983 also added better tracking of this) 
coming from {{StreamingTest.testExceptionStream}} - a test method that opens 
_five_ very similar {{ExceptionStream}} instances, wrapping {{CloudSolrStream}} 
instance, which expect to trigger server side errors.

By it's very design, {{ExceptionStream}} catches & records any exceptions from 
the stream it wraps, so even in the event of these "expected" server side 
errors, {{ExceptionStream.close()}} should still be correctly getting called 
(and propagating down to the {{CloudStream}} it wraps).

I believe the underlying problem has to do with a concurrency race condition 
between the call to {{CloudStream.close()}} and the {{ExecutorService}} used 
internally by {{CloudSolrStream.openStreams()}} (details to follow)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to