[ https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-17764: -------------------------------------- Attachment: E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt Status: Open (was: Open) (Note all comments below relate to testing done against GIT SHA d0d71a7b38f87f3c70551678b2eab62d925483ef) I'm attaching an example OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt using seed=E7F93005B9386058 While the test "officially" fails at {{ChaosMonkeySafeLeaderWithPullReplicasTest.java:210}} this is just an assertion that none of the {{StoppableIndexingThread}} encountered any failures. If we search the test for the first instance of {{REQUEST FAILED}} logged by any {{StoppableIndexingThread}} we see the following: Note the port number of request URI in the error... {noformat} 2> 25374 ERROR (StoppableIndexingThread) [n: c: s: r: x: t:] o.a.s.c.StoppableIndexingThread REQUEST FAILED for id=0-14 2> => org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from server at http://127.0.0.1:33083/solr: Expected mime type in [application/octet-stream, application/vnd.apache.solr.javabin] but got text/html. <html> 2> <head> 2> org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from server at http://127.0.0.1:33083/solr: Expected mime type in [application/octet-stream, application/vnd.apache.solr.javabin] but got text/html. <html> 2> <head> 2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> 2> <title>Error 503 Service Unavailable</title> 2> </head> 2> <body><h2>HTTP ERROR 503 Service Unavailable</h2> 2> <table> 2> <tr><th>URI:</th><td>/solr/collection1_shard1_replica_p9/update</td></tr> 2> <tr><th>STATUS:</th><td>503</td></tr> 2> <tr><th>MESSAGE:</th><td>Service Unavailable</td></tr> 2> <tr><th>SERVLET:</th><td>-</td></tr> 2> </table> 2> <hr/><a href="https://jetty.org">Powered by Jetty:// 10.0.22</a><hr/> 2> 2> </body> 2> </html> 2> 2> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:651) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:250) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:515) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:497) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:524) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:466) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1151) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:844) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:775) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:285) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:305) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]] 2> at org.apache.solr.cloud.StoppableIndexingThread.indexDocs(StoppableIndexingThread.java:193) ~[solr-test-framework-10.0.0-SNAPSHOT.jar:?] 2> at org.apache.solr.cloud.StoppableIndexingThread.run(StoppableIndexingThread.java:124) [solr-test-framework-10.0.0-SNAPSHOT.jar:?] {noformat} If we go backwards a little bit in the log file, looking specifically for port number {{33083}} we see... {noformat} 2> 25336 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.a.s.c.ChaosMonkey monkey: chose a victim! 33083 2> 25337 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.Server Stopped Server@69659ae4{STOPPING}[10.0.22,sto=15000] 2> 25337 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.Server Shutdown Server@69659ae4{STOPPING}[10.0.22,sto=15000] ... 2> 25373 INFO (StoppableIndexingThread) [n: c: s: r: x: t:] o.a.s.c.s.i.CloudSolrClient request was not communication error it seems 2> 25373 INFO (StoppableIndexingThread) [n: c: s: r: x: t:] o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to (503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from server at http://127.0.0.1:33083/solr: Expected mime type in [application/octet-stream, application/vnd.apache.solr.javabin] but got text/html. <html> 2> <head> 2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> 2> <title>Error 503 Service Unavailable</title> 2> </head> 2> <body><h2>HTTP ERROR 503 Service Unavailable</h2> 2> <table> 2> <tr><th>URI:</th><td>/solr/collection1_shard1_replica_p9/update</td></tr> 2> <tr><th>STATUS:</th><td>503</td></tr> 2> <tr><th>MESSAGE:</th><td>Service Unavailable</td></tr> 2> <tr><th>SERVLET:</th><td>-</td></tr> 2> </table> 2> <hr/><a href="https://jetty.org">Powered by Jetty:// 10.0.22</a><hr/> 2> 2> </body> 2> </html> 2> , retry=0 maxRetries=5 commError=false errorCode=503 2> 25374 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.AbstractConnector Stopped ServerConnector@37112d56{HTTP/1.1, (http/1.1, h2c)}{127.0.0.1:0} {noformat} Here's what seems to be happening AFAICT: * FYI: This test seems to put a single replica on each jetty for simplicity * The test randomization can randomly decide to use a CloudSolrClient that has {{isUpdatesToLeaders()==false}} ** When that happens, updates may be sent to a pull replica, which is expected to forward the request to it's leader * _Before_ we had graceful jetty shutdown: ** If ChaosMonkey decided to shutdown a pull-replica node, any existing HTTP connections from a StoppableIndexingThread would fail immediately with a connection level error ** At which point the CloudSolrClient used by the StoppableIndexingThread considers it a "communications error" and will retry a different replica/jetty * _After_ the addition of graceful jetty shutdown: ** if ChaosMonkey decided to shutdown a pull-replica node, any existing HTTP connections from a StoppableIndexingThread stay alive until the server is "finished" with the HTTP request ** But logic in solr may decide to "finish" the request prematurely by throwing a 503 SolrException if it notices CoreContainer shutdown is in progress. Example: {{DistributedZkUpdateProcessor.zkCheck()}} ** At which point the CloudSolrClient used by the StoppableIndexingThread gets a {{RemoteSolrException}} , and does _not_ consider that to be a communcation error, and does _not_ retry the update with a different replica/jetty ---- While it would be easy to "fix" this test by forcing {{{}isUpdatesToLeaders()==true{}}}, I'm not sure what the best "fix" is for the underlying behavior in Solr/SolrJ is? * Should code in Solr that checks for shutdown and throws a 503 _stop_ doing that so that in flight requests can finish? * Should CloudSolrClient retry on any 503 ? * Should we revert the use of jetty's graceful shutdown (or change it's default to 'false') ? ** This seems kind of insane, but at the moment: the behavior of CloudSolrClient + solr-core shutdown checks means CloudSolrClient is more likely to retry other nodes w/o jetty's graceful shutdown logic in place. > "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest > failures > ----------------------------------------------------------------------------------- > > Key: SOLR-17764 > URL: https://issues.apache.org/jira/browse/SOLR-17764 > Project: Solr > Issue Type: Bug > Reporter: Chris M. Hostetter > Priority: Major > Attachments: > E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt > > > Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test > ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time > right around the time that SOLR-17744 was committed. > Things i have observed: > * Seeds from failing runs seem to reliably reproduce the failure > ** These failures do *NOT* reproduce if i revert to just before SOLR-17744 > * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem > to reliably succeed on all subsequent attempts > ** Suggesting that the root cause is something deterministic in the > {{{}random(){}}}-ness of the test, and not something dependent on timing or > concurrency. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org