[jira] [Commented] (SOLR-13038) Overseer actions fail with NoHttpResponseException following a node restart
[ https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708133#comment-16708133 ] Jason Gerlowski commented on SOLR-13038: I've attached a strawman patch that adds a very basic retry check into HttpShardHandler. Most of the patch is just plumbing to pass around a "retryable" boolean to where it can be added to {{ShardRequest}}. This plumbing is pretty rough - I wouldn't commit it without finding something a little more elegant - but it's sufficient for showing the change conceptually. Having seen a lot of discussion on prior JIRAs related to this issue, it seems like there's a lot of concern about retrying on this particular error case. To summarize, {{NoHttpResponseException}} is ambiguous - there's no way to tell whether the server received and processed your request or not. So a requirement is that we avoid retrying any non-idempotent requests. That was the main goal in choosing the approach I did for this strawman patch. Each caller of HttpShardHandler can choose whether they're OK with their request being retried, with the default being to not retry. Anyway, curious if people have any thoughts. Oh, one last thing. Also in this patch is an additional assertion to LeaderTragicEventTest that exhibits the problem. It passes with the rest of the patch, but will fail and show the problem when applied on its own. > Overseer actions fail with NoHttpResponseException following a node restart > --- > > Key: SOLR-13038 > URL: https://issues.apache.org/jira/browse/SOLR-13038 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: master (8.0) >Reporter: Jason Gerlowski >Assignee: Jason Gerlowski >Priority: Major > Attachments: SOLR-13038.patch > > > I noticed recently that a lot of overseer operations fail if they're executed > right after a restart of a Solr node. The failure returns a message like > "org.apache.solr.client.solrj.SolrServerException:IOException occured when > talking to server at: https://127.0.0.1:62253/solr;. The logs are a bit more > helpful: > {code} > org.apache.solr.client.solrj.SolrServerException: IOException occured when > talking to server at: https://127.0.0.1:62253/solr > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657) > ~[java/:?] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) > ~[java/:?] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) > ~[java/:?] > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) > ~[java/:?] > at > org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172) > ~[java/:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_172] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172] > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > ~[metrics-core-3.2.6.jar:3.2.6] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > ~[java/:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_172] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_172] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to > respond > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) > ~[httpcore-4.4.10.jar:4.4.10] > at
[jira] [Commented] (SOLR-13038) Overseer actions fail with NoHttpResponseException following a node restart
[ https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707814#comment-16707814 ] Jason Gerlowski commented on SOLR-13038: You can reproduce this behavior pretty regularly with the JUnit test below that uses SolrCloudTestCase as its base: {code} @Test public void testOtherReplicasAreNotActive() throws Exception { final String collection = "collection1"; CollectionAdminRequest .createCollection(collection, "config", 1, 2) .process(cluster.getSolrClient()); cluster.waitForActiveCollection(collection, 1, 2); Slice shard = getCollectionState(collection).getSlice("shard1"); JettySolrRunner otherReplicaJetty = cluster.getReplicaJetty(getNonLeader(shard)); otherReplicaJetty.stop(); cluster.waitForJettyToStop(otherReplicaJetty); waitForState("Timeout waiting for replica get down", collection, (liveNodes, collectionState) -> getNonLeader(collectionState.getSlice("shard1")).getState() != Replica.State.ACTIVE); otherReplicaJetty.start(); cluster.waitForNode(otherReplicaJetty, 30); waitForState("Timeout waiting for replica get up", collection, (liveNodes, collectionState) -> getNonLeader(collectionState.getSlice("shard1")).getState() == Replica.State.ACTIVE); CollectionAdminResponse response = CollectionAdminRequest.deleteCollection(collection).process(cluster.getSolrClient()); assertNull("Expected collection-delete to fully succeed", response.getResponse().get("failure")); } {code} > Overseer actions fail with NoHttpResponseException following a node restart > --- > > Key: SOLR-13038 > URL: https://issues.apache.org/jira/browse/SOLR-13038 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: master (8.0) >Reporter: Jason Gerlowski >Assignee: Jason Gerlowski >Priority: Major > > I noticed recently that a lot of overseer operations fail if they're executed > right after a restart of a Solr node. The failure returns a message like > "org.apache.solr.client.solrj.SolrServerException:IOException occured when > talking to server at: https://127.0.0.1:62253/solr;. The logs are a bit more > helpful: > {code} > org.apache.solr.client.solrj.SolrServerException: IOException occured when > talking to server at: https://127.0.0.1:62253/solr > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657) > ~[java/:?] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) > ~[java/:?] > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) > ~[java/:?] > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) > ~[java/:?] > at > org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172) > ~[java/:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_172] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172] > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > ~[metrics-core-3.2.6.jar:3.2.6] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > ~[java/:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_172] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_172] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to > respond > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) > ~[httpcore-4.4.10.jar:4.4.10] > at > org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) > ~[httpclient-4.5.6.jar:4.5.6] > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) > ~[httpcore-4.4.10.jar:4.4.10] > at >