[jira] [Commented] (SOLR-13038) Overseer actions fail with NoHttpResponseException following a node restart

2018-12-03 Thread Jason Gerlowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708133#comment-16708133
 ] 

Jason Gerlowski commented on SOLR-13038:


I've attached a strawman patch that adds a very basic retry check into 
HttpShardHandler.  Most of the patch is just plumbing to pass around a 
"retryable" boolean to where it can be added to {{ShardRequest}}.  This 
plumbing is pretty rough - I wouldn't commit it without finding something a 
little more elegant - but it's sufficient for showing the change conceptually.

Having seen a lot of discussion on prior JIRAs related to this issue, it seems 
like there's a lot of concern about retrying on this particular error case.  To 
summarize, {{NoHttpResponseException}} is ambiguous - there's no way to tell 
whether the server received and processed your request or not.  So a 
requirement is that we avoid retrying any non-idempotent requests.  That was 
the main goal in choosing the approach I did for this strawman patch.  Each 
caller of HttpShardHandler can choose whether they're OK with their request 
being retried, with the default being to not retry.

Anyway, curious if people have any thoughts.

Oh, one last thing.  Also in this patch is an additional assertion to 
LeaderTragicEventTest that exhibits the problem.  It passes with the rest of 
the patch, but will fail and show the problem when applied on its own. 

> Overseer actions fail with NoHttpResponseException following a node restart
> ---
>
> Key: SOLR-13038
> URL: https://issues.apache.org/jira/browse/SOLR-13038
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (8.0)
>Reporter: Jason Gerlowski
>Assignee: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-13038.patch
>
>
> I noticed recently that a lot of overseer operations fail if they're executed 
> right after a restart of a Solr node.  The failure returns a message like 
> "org.apache.solr.client.solrj.SolrServerException:IOException occured when 
> talking to server at: https://127.0.0.1:62253/solr;.  The logs are a bit more 
> helpful:
> {code}
> org.apache.solr.client.solrj.SolrServerException: IOException occured when 
> talking to server at: https://127.0.0.1:62253/solr
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657)
>  ~[java/:?]
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
>  ~[java/:?]
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
>  ~[java/:?]
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) 
> ~[java/:?]
> at 
> org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)
>  ~[java/:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_172]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
> at 
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>  ~[metrics-core-3.2.6.jar:3.2.6]
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>  ~[java/:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_172]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_172]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
> Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to 
> respond
> at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
>  ~[httpclient-4.5.6.jar:4.5.6]
> at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
>  ~[httpclient-4.5.6.jar:4.5.6]
> at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 
> org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 
> org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
>  ~[httpclient-4.5.6.jar:4.5.6]
> at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 

[jira] [Commented] (SOLR-13038) Overseer actions fail with NoHttpResponseException following a node restart

2018-12-03 Thread Jason Gerlowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707814#comment-16707814
 ] 

Jason Gerlowski commented on SOLR-13038:


You can reproduce this behavior pretty regularly with the JUnit test below that 
uses SolrCloudTestCase as its base:

{code}
@Test
  public void testOtherReplicasAreNotActive() throws Exception {
final String collection = "collection1";
CollectionAdminRequest
.createCollection(collection, "config", 1, 2)
.process(cluster.getSolrClient());
cluster.waitForActiveCollection(collection, 1, 2);
Slice shard = getCollectionState(collection).getSlice("shard1");
JettySolrRunner otherReplicaJetty = 
cluster.getReplicaJetty(getNonLeader(shard));

otherReplicaJetty.stop();
cluster.waitForJettyToStop(otherReplicaJetty);
waitForState("Timeout waiting for replica get down", collection, 
(liveNodes, collectionState) -> 
getNonLeader(collectionState.getSlice("shard1")).getState() != 
Replica.State.ACTIVE);
otherReplicaJetty.start();
cluster.waitForNode(otherReplicaJetty, 30);
waitForState("Timeout waiting for replica get up", collection, (liveNodes, 
collectionState) -> getNonLeader(collectionState.getSlice("shard1")).getState() 
== Replica.State.ACTIVE);
CollectionAdminResponse response = 
CollectionAdminRequest.deleteCollection(collection).process(cluster.getSolrClient());
assertNull("Expected collection-delete to fully succeed", 
response.getResponse().get("failure"));
  }
{code}

> Overseer actions fail with NoHttpResponseException following a node restart
> ---
>
> Key: SOLR-13038
> URL: https://issues.apache.org/jira/browse/SOLR-13038
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (8.0)
>Reporter: Jason Gerlowski
>Assignee: Jason Gerlowski
>Priority: Major
>
> I noticed recently that a lot of overseer operations fail if they're executed 
> right after a restart of a Solr node.  The failure returns a message like 
> "org.apache.solr.client.solrj.SolrServerException:IOException occured when 
> talking to server at: https://127.0.0.1:62253/solr;.  The logs are a bit more 
> helpful:
> {code}
> org.apache.solr.client.solrj.SolrServerException: IOException occured when 
> talking to server at: https://127.0.0.1:62253/solr
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657)
>  ~[java/:?]
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
>  ~[java/:?]
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
>  ~[java/:?]
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) 
> ~[java/:?]
> at 
> org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)
>  ~[java/:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_172]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
> at 
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>  ~[metrics-core-3.2.6.jar:3.2.6]
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>  ~[java/:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_172]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_172]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
> Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to 
> respond
> at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
>  ~[httpclient-4.5.6.jar:4.5.6]
> at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
>  ~[httpclient-4.5.6.jar:4.5.6]
> at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 
> org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 
> org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
>  ~[httpclient-4.5.6.jar:4.5.6]
> at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
>  ~[httpcore-4.4.10.jar:4.4.10]
> at 
>