[
https://issues.apache.org/jira/browse/SOLR-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506622#comment-17506622
]
Chris M. Hostetter commented on SOLR-16099:
-------------------------------------------
I'm filing this issue for tracking purposes and to raise visibility of the
symptoms and workaround - I don't know if/how it can be "fixed" in Solr.
This popped up for me when dealing with a customer who noticed they were
running out of disk on nodes hosting PULL replicas of large collection that
were getting lots of document updates. The hung threads were holding
{{SolrIndexSearcher}} references, which had open file handles to segments that
were no longer in use by the "current" index but couldn't be overwritten by the
filesystem because of the open file handles).
This problem could manifest in other ways for other people – the most likely
(beyond the obvious failure or "slow" request metrics) being high memory usage
as these threads and the object they hold references to taking up Heap space
indefinitely.
----
While investigating open jetty issues that might pertain to this, I discovered
that Ishan & Noble have apparently [been aware of this
problem|https://github.com/fullstorydev/lucene-solr/pull/121] [for a
while|https://github.com/fullstorydev/lucene-solr/pull/106], and already have
[an open ticket with the Jetty project
(#7192)|https://github.com/eclipse/jetty.project/issues/7192] regarding these
symptoms.
(Ishan alluded to this problem in SOLR-15840, but that Jira is a general "Task"
regarding questions about the performance of HTTP2 in Solr, that doesn't
explicitly call out that this "Bug" exists and can hang threads - hence I've
opened this new Jira.)
So far the only known [steps to reliably
reproduce|https://github.com/eclipse/jetty.project/issues/7192#issuecomment-989727192]
are fairly onerous (they crashed my machine while attempting to run them)
which makes it hard to easily identify a root cause, but the [latest analysis
from jetty
dev|https://github.com/eclipse/jetty.project/issues/7192#issuecomment-991068558]
{{sbordet}} is...
{quote}We were able to get a failure, so we instrumented the client to
understand why it was waiting, and found that it was waiting for a request
whose headers arrived fine (with a 200 status code) but the client was waiting
for the response body (no errors).
We tried to track this request on the server logs, but we could not find any
information.
Any attempt to increment the log level to DEBUG makes the failure
non-reproducible, so we are a bit stuck trying to understand what's going on.
>From our point of view, the client received response headers and it's waiting
>for the content, but it never arrives.
We cannot explain how the response headers arrived, but not the content, but we
don't have information about the server – whether there was an error on the
server or similar, because the logs don't show any error.
At this point I'm throwing the ball in your court: are you sure that you are
sending the response content?
...
If we can track the tcpdump and see that the HTTP/2 frames reach the client,
then it's a Jetty client problem;
otherwise the server did not send the HTTP/2 frames and this could be either
for a Jetty server problem, or because the application did not send any data.
In this way we're trying to restrict the problem.
{quote}
The issue seems to have stalled out ~2 months ago while attempting to provide
usable {{tcpdump}} trace of the affected requests.
[~ichattopadhyaya] / [~noblepaul] : Do you have any more info on this that you
can share? If you can reliably reproduce this can you update the jetty ticket
w/usable {{tcpdump}} logs?
----
One thing we may want to consider (in Solr) is replacing our usage of
{{InputStreamResponseListener}} with a variant implementation that uses a
"timeout" instead of an unlimited {{wait()}} (along the lines of a [spin-off
jetty enhancement issue|https://github.com/eclipse/jetty.project/issues/7259]
one of the jetty devs filed). We could probably (with some effort) tweak the
impacted Solr APIs to propogate the (remaining) {{timeAllowed}} (if that option
was specified) down to this class – and/or have an "extreme" default (ie:
30min) just to prevent threads from sticking around forever.
> HTTP Client threads can hang in Jetty's InputStreamResponseListener when
> using HTTP2 - impacts intra-node communication
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-16099
> URL: https://issues.apache.org/jira/browse/SOLR-16099
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud, SolrJ
> Reporter: Chris M. Hostetter
> Priority: Major
>
> There apearrs to be a Jetty HttpClient bug that makes it possible for a
> request thread to hang indefinitely while waiting to parse the response from
> remote jetty servers. The cause of the hung thread is because it calls
> {{.wait()}} on monitor lock that _should_ be notified by another (internal
> jetty client) thread when a chunk of data is available from the wire – but in
> some cases this evidently may not happen.
> In the case of {{distrib=true}} requests processed by Solr (aggregating
> multiple per-shard responses from other nodes) this can manifest with stack
> traces that look like the following (taken from Solr 8.8.2)...
> {noformat}
> "thread",{
> "id":14253,
> "name":"httpShardExecutor-7-thread-13819-...",
> "state":"WAITING",
>
> "lock":"org.eclipse.jetty.client.util.InputStreamResponseListener@12b59075",
> "lock-waiting":{
>
> "name":"org.eclipse.jetty.client.util.InputStreamResponseListener@12b59075",
> "owner":null},
>
> "synchronizers-locked":["java.util.concurrent.ThreadPoolExecutor$Worker@1ec1aed0"],
> "cpuTime":"65.4882ms",
> "userTime":"60.0000ms",
> "stackTrace":["[email protected]/java.lang.Object.wait(Native Method)",
> "[email protected]/java.lang.Object.wait(Unknown Source)",
>
> "org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318)",
>
> "org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:90)",
>
> "org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:99)",
>
> "org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:217)",
> "org.apache.solr.common.util.JavaBinCodec._init(JavaBinCodec.java:211)",
>
> "org.apache.solr.common.util.JavaBinCodec.initRead(JavaBinCodec.java:202)",
>
> "org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:195)",
>
> "org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:51)",
>
> "org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:696)",
>
> "org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:412)",
>
> "org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)",
>
> "org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:369)",
>
> "org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:297)",
>
> "org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:371)",
>
> "org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:132)",
>
> "org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:41)",
> "[email protected]/java.util.concurrent.FutureTask.run(Unknown Source)",
>
> "[email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)",
> "[email protected]/java.util.concurrent.FutureTask.run(Unknown Source)",
>
> "com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)",
>
> "org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)",
>
> "org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$175/0x0000000840243c40.run(Unknown
> Source)",
>
> "[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)",
>
> "[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)",
> "[email protected]/java.lang.Thread.run(Unknown Source)"]},
> {noformat}
> ...these {{httpShardExecutor}} threads can stay hung, tying up system
> resources, indefinitely (unless they get a spuriuos {{notify()}} from the
> JVM). (In once case, it seems to have caused a request to hang for {*}10.3
> hours{*})
> Anecdotally:
> * There is some evidence that this problem did _*NOT*_ affect Solr 8.6.3,
> but does affect later versions
> ** suggesting the bug didn't exist in Jetty until _after_ 9.4.27.v20200227
> * Forcing the Jetty HttpClient to use HTTP1.1 transport seems to prevent
> this problem from happening
> ** In Solr this can be done by setting the {{"solr.http1"}} system property
> ** Or using the {{Http2SolrClient.Builder.useHttp1_1()}} method in client
> application code
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]