[
https://issues.apache.org/jira/browse/SOLR-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris M. Hostetter updated SOLR-16992:
--------------------------------------
Attachment: OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt
thetaphi_solr_Solr-main-Linux_14679.log.txt
Status: Open (was: Open)
I'm attaching two files: a typical jenkins job failure of {{StreamingTest}}
({_}before{_} SOLR-16983) and the non-reproducable failure i was able to
locally coax out of branch_9x ({_}after{_} SOLR-16983)
Here's the important snippet from the output of the local failure...
{noformat}
> java.lang.AssertionError: ObjectTracker found 1 object(s) that were
not released!!! [Input]
>
org.eclipse.jetty.client.util.InputStreamResponseListener$Input:org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException:
org.eclipse.jetty.client.u
til.InputStreamResponseListener$Input
> at
org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:52)
> at
org.apache.solr.client.solrj.impl.Http2SolrClient$InputStreamReleaseTrackingResponseListener$ObjectReleaseTrackedInputStream.<init>(Http2SolrClient.java:
1495)
> at
org.apache.solr.client.solrj.impl.Http2SolrClient$InputStreamReleaseTrackingResponseListener.getInputStream(Http2SolrClient.java:1489)
> at
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:532)
> at
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1192)
> at
org.apache.solr.client.solrj.io.stream.SolrStream.constructParser(SolrStream.java:304)
> at
org.apache.solr.client.solrj.io.stream.SolrStream.open(SolrStream.java:129)
> at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$StreamOpener.call(CloudSolrStream.java:539)
> at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$StreamOpener.call(CloudSolrStream.java:527)
> at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289)
> at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.Exception: Submitter stack trace
> at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:256)
> at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
> at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.lambda$openStreams$8(CloudSolrStream.java:419)
> at
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
> at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:420)
> at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:292)
> at
org.apache.solr.client.solrj.io.stream.ExceptionStream.open(ExceptionStream.java:52)
> at
org.apache.solr.client.solrj.io.stream.StreamingTest.getTuple(StreamingTest.java:2895)
> at
org.apache.solr.client.solrj.io.stream.StreamingTest.testExceptionStream(StreamingTest.java:662)
{noformat}
Here is roughly what i think is happening...
* {{testExceptionStream()}} calls {{getTuple(estream)}} calls
{{ExceptionStream.open()}} calls {{CloudSolrStream.open()}}
* {{CloudSolrStream.open()}} initializes {{List<TupleStream> solrStreams}}
containing one {{SolrStream}} per shard and calls
{{CloudSolrStream.openStreams()}}
* {{CloudSolrStream.openStreams()}} ...
** spins up an {{ExecutorService servce =
ExecutorUtil.newMDCAwareCachedThreadPool(new
SolrNamedThreadFactory("CloudSolrStream"))}}
** calls {{service.submit(new StreamOpener(...))}} on every instance in
{{solrStreams}} and records the {{List<Future<TupleWrapper>> futures}}
** loops over the {{futures}} ...
*** If any throw an (Execution) exception: immediately wrap and rethrow that
exception
** call {{servce.shutdown()}} in a {{finally}} block
*** *This is not a blocking call!*
*** It does not terminate (or wait for completion) of any {{StreamOpener}}
instances currently being executed by ThreadPool threads.
* {{ExceptionStream.open()}} will catch the exception thrown by
{{CloudSolrStream.open()}} and store it in a Tuple
* {{getTuple(estream)}} gets the Tuple from {{ExceptionStream.read()}} and
calls {{ExceptionStream.close()}} calls {{CloudStream.close()}}
* At this point {{getTuple(estream)}} returns the (expected) Tuple from
{{ExceptionStream}} and the test can make assertions on them and pass
** *BUT MEANWHILE* ...
** _after_ the test thread has (cascadingly) called
{{{}CloudStream.close(){}}}, threads from the {{ExecutorService}} may still be
calling {{SolrStream.open()}} on the instances in {{CloudStream.solrStreams}}
*** Due to the way the {{SolrClientCache}} works, These {{SolrStream.open()}}
calls can/will cause new {{Http2SolrClient}} instances to be created
**** These {{Http2SolrClient}} instances may be opened _after_ the test thread
has called {{SolrClientCache.close()}} ... causing them to leak
*** These {{SolrStream.open()}} calls can/will cause new HTTP Connections to
be initiated, opening new {{InputStream}} instances
**** These {{InputStream}} instances will never be closed, because the
{{SolrStream}} instances will never be "re"-closed.
----
I think at a minimum, {{CloudSolrStream.openStreams()}} should be changed such
that...
* It waits for the results of _every_ {{{}Future{}}}, even if one of them
throws an exception
** The {{IOException}} it ultimate throws should wrap one of these caught
exceptions, and call {{addSuppressed}} on all the rest
*** This seems like a cleaner/better behavior for the exception handling in
{{{}CloudSolrStream.openStreams(){}}}, independent of the concurrency race
conditions being discussed
*** A key benefit of this, is that it ensures all the {{StreamOpener}}
instances will have completed before {{CloudSolrStream.openStreams()}} returns
– ensuring that no background threads can/will muck with the internals of
{{CloudSolrStream}} while the caller has moved on to calling
{{CloudSolrStream.close()}}
* it might make sense to instead use {{shutdownNow() + awaitTermination()}}
if/when any {{Future}} fails
** but that would definitely be more complicated to implement (let alone
implement correctly)
----
Going above and beyond that, it's worth considering:
* update many {{Closeable}} classes in solr to keep track of if/when {{close}}
is called more then once
*
** ie: a new {{private transient boolean isClosed = false;}} that {{close()}}
sets to {{true}}
* In the specific case of streaming classes discussed in this jira...
** Change {{SolrClientCache}} so that any method that will add to
{{solrClients}} throws an {{IllegalStateException}} if {{isClosed}}
** make all {{TupleStream}} instances throw an {{IllegalStateException}} from
{{open()}} and/or {{read()}} if {{isClosed}}
* In the general case, maybe the {{close()}} method of many solr {{Closeable}}
classes should...
** log a "WARN" if {{isClosed}} is already true
** {{assert ! isClosed : "already closed";}} so we trip tests if this happens
** (Neither of these are strictly idempotent, but I think the end results
would be better then perfect idempotent-ness)
/cc [~stillalex]
> Non-reproducible StreamingTest failures -- suggests CloudSolrStream
> concurency race condition
> ---------------------------------------------------------------------------------------------
>
> Key: SOLR-16992
> URL: https://issues.apache.org/jira/browse/SOLR-16992
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments:
> OUTPUT-org.apache.solr.client.solrj.io.stream.StreamingTest.txt,
> thetaphi_solr_Solr-main-Linux_14679.log.txt
>
>
> Roughly 3% of all jenkins jobs that run {{StreamingTest}} wind up having
> suite level failures.
> These failures have historically taken the form of
> {{com.carrotsearch.randomizedtesting.ThreadLeakError}} and the leaked threads
> all have names like
> {{"h2sc-718-thread-2"}} indicating that they come from the internal
> {{ExecutorService}} of an {{{}Http2SolrClient{}}}.
> In my experience, the seeds from these failures have never reproduced -
> suggesting that the problem is related to concurrency.
> SOLR-16983 restored the (correct) use of {{ObjectReleaseTracker}} which in
> theory should help pinpoint where {{Http2SolrClient}} instances might not be
> getting closed (by causing {{ObjectReleaseTracker}} to fail with stacktraces
> of when/where any unclosed instances were created - ie: which test method)
> In practice, I have managed to force one failure from {{StreamingTest}} since
> the SOLR-16983 changes (logs to be attached soon) - but it still didn't
> indicate any leaked/unclosed {{Http2SolrClient}} instances. What it instead
> indicated was a _single_ unclosed {{InputStream}} instance related to
> {{Http2SolrClient}} connections (SOLR-16983 also added better tracking of
> this) coming from {{StreamingTest.testExceptionStream}} - a test method that
> opens _five_ very similar {{ExceptionStream}} instances, wrapping
> {{CloudSolrStream}} instance, which expect to trigger server side errors.
> By it's very design, {{ExceptionStream}} catches & records any exceptions
> from the stream it wraps, so even in the event of these "expected" server
> side errors, {{ExceptionStream.close()}} should still be correctly getting
> called (and propagating down to the {{CloudStream}} it wraps).
> I believe the underlying problem has to do with a concurrency race condition
> between the call to {{CloudStream.close()}} and the {{ExecutorService}} used
> internally by {{CloudSolrStream.openStreams()}} (details to follow)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]