[jira] [Commented] (SOLR-16992) Non-reproducible StreamingTest failures -- suggests CloudSolrStream concurency race condition

Chris M. Hostetter (Jira) Tue, 26 Sep 2023 15:09:19 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769354#comment-17769354
 ]


Chris M. Hostetter commented on SOLR-16992:
-------------------------------------------

{quote}...looked at all uses again and the large majority filter nulls, one 
would NPE and the shards ones have no nulls.
{quote}
That's fine ... if the only code paths you've changed that _don't_ previously 
filter nulls are impossible to produce nulls (or would NPE if a null was 
produced) that i'm fine with the idea of refactoring the null-filtering check 
into {{StreamExecutorHelper.submitAllAndAwaitAggregatingExceptions}} if you 
prefer that – as long as it's mentioned in the javadocs

(but i'm -1 to th null filtering being in 
{{{}ExecutorUtil.submitAllAndAwaitAggregatingExceptions{}}})

Only a few questions/concerns about your current diff...
 * I'm wondering if 
{{StreamExecutorHelper.submitAllAndAwaitAggregatingExceptions}} and 
{{ExecutorUtil.submitAllAndAwaitAggregatingExceptions}} should return 
{{Collection<T>}} instead of {{List<T>}} to help discourage people from making 
assumptions about the order of the results?
 ** Either that, or the javadocs should be explicit that the results are 
returned in the same order as the tasks
 *** Except if you do refactor the null filtering into 
{{StreamExecutorHelper.submitAllAndAwaitAggregatingExceptions}} it definitely 
shouldn't say that, and should return {{Collection}} instead of {{List}}
 * In {{{}ExecutorUtil.submitAllAndAwaitAggregatingExceptions{}}}...
 ** {{List<T> results = new ArrayList<>()}} can be init'ed using 
{{tasks.size()}} as a micro optimization to eliminate the possibility that the 
{{ArrayList}} will ever need to resize itself.
 ** let's add a comment to the 
{{tasks.stream().map(service::submit).collect(Collectors.toUnmodifiableList())}}
 call ...
 *** {{// Could alternatively use service.invokeAll, but this way we can start 
looping over futures before all are done}}
 * StreamExecutorHelperTest should extend SolrTestCase
 * Generally speaking, we avoid this pattern...
 ** 
{code:java}
try {
  codeThatThrowsException()
  fail("Expected exception");
} catch (TypeOfExpectedExeption e) {
  assertThat(somethingAbout(e))
}
{code}

 ** It tends to lead to developer confusion/mistakes when modifying tests
 ** instead the lucene & solr code base convention is...
{code:java}
TypeOfExpectedExeption e = expetThrows(TypeOfExpectedExeption.class, "Expected 
exception", () -> codeThatThrowsException());
assertThat(somethingAbout(e))
{code}

> Non-reproducible StreamingTest failures -- suggests CloudSolrStream 
> concurency race condition
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-16992
>                 URL: https://issues.apache.org/jira/browse/SOLR-16992
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Assignee: Alex Deparvu
>            Priority: Major
>         Attachments: 
> OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt, 
> thetaphi_solr_Solr-main-Linux_14679.log.txt
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Roughly 3% of all jenkins jobs that run {{StreamingTest}} wind up having 
> suite level failures.
> These failures have historically taken the form of 
> {{com.carrotsearch.randomizedtesting.ThreadLeakError}} and the leaked threads 
> all have names like
> {{"h2sc-718-thread-2"}} indicating that they come from the internal 
> {{ExecutorService}} of an {{{}Http2SolrClient{}}}.
> In my experience, the seeds from these failures have never reproduced - 
> suggesting that the problem is related to concurrency.
> SOLR-16983 restored the (correct) use of {{ObjectReleaseTracker}} which in 
> theory should help pinpoint where {{Http2SolrClient}} instances might not be 
> getting closed (by causing {{ObjectReleaseTracker}} to fail with stacktraces 
> of when/where any unclosed instances were created - ie: which test method)
> In practice, I have managed to force one failure from {{StreamingTest}} since 
> the SOLR-16983 changes (logs to be attached soon) - but it still didn't 
> indicate any leaked/unclosed {{Http2SolrClient}} instances. What it instead 
> indicated was a _single_ unclosed {{InputStream}} instance related to 
> {{Http2SolrClient}} connections (SOLR-16983 also added better tracking of 
> this) coming from {{StreamingTest.testExceptionStream}} - a test method that 
> opens _five_ very similar {{ExceptionStream}} instances, wrapping 
> {{CloudSolrStream}} instance, which expect to trigger server side errors.
> By it's very design, {{ExceptionStream}} catches & records any exceptions 
> from the stream it wraps, so even in the event of these "expected" server 
> side errors, {{ExceptionStream.close()}} should still be correctly getting 
> called (and propagating down to the {{CloudStream}} it wraps).
> I believe the underlying problem has to do with a concurrency race condition 
> between the call to {{CloudStream.close()}} and the {{ExecutorService}} used 
> internally by {{CloudSolrStream.openStreams()}} (details to follow)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-16992) Non-reproducible StreamingTest failures -- suggests CloudSolrStream concurency race condition

Reply via email to