Jason Gerlowski created SOLR-13038:
--------------------------------------

             Summary: Overseer actions fail with NoHttpResponseException 
following a node restart
                 Key: SOLR-13038
                 URL: https://issues.apache.org/jira/browse/SOLR-13038
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: master (8.0)
            Reporter: Jason Gerlowski
            Assignee: Jason Gerlowski


I noticed recently that a lot of overseer operations fail if they're executed 
right after a restart of a Solr node.  The failure returns a message like 
"org.apache.solr.client.solrj.SolrServerException:IOException occured when 
talking to server at: https://127.0.0.1:62253/solr";.  The logs are a bit more 
helpful:

{code}
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: https://127.0.0.1:62253/solr
    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657)
 ~[java/:?]
    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
 ~[java/:?]
    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
 ~[java/:?]
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) 
~[java/:?]
    at 
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)
 ~[java/:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_172]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
    at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
 ~[metrics-core-3.2.6.jar:3.2.6]
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
 ~[java/:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_172]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_172]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to 
respond
    at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
 ~[java/:?]
    at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
 ~[java/:?]
    ... 12 more
{code}

After a bit of debugging I was able to confirm the problem: when some 
non-overseer node gets restarted, the overseer never notices that its 
connections are invalid and will try to reuse them for subsequent requests that 
happen right after the restart.

There's a few ways we might be able to tackle this:
* we could look at adding logic to {{SolrHttpRequestRetryHandler}} to retry 
when this happens.  SHRRH already retries NoHttpResponseException generally, 
but has other logic which prevents any retries on collection/core-admin APIs.  
Maybe we could elaborate this a bit.
* we could add retry logic to the {{HttpShardHandler}} code that makes these 
requests.  We could do this across the board, or more selectively for only the 
overseer commands that are "retry-able".
* We could tweak how our connection pool is managed so that it evicts these 
idle connections more aggressively.  It seems like something similar has 
already been tried (without success) on SOLR-6944

Not sure what the right approach is.  Seems like intermittent 
NoHttpResponseExceptions have been a problem in Solr (and its tests) going back 
at least 5 years or so.  Several JIRAs suggested adding retries for NHRE in the 
past but have been killed since not all APIs are idempotent and other JIRAs 
have been concerned with fixing this at the (very broad) SolrClient level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to