[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-17764:
--------------------------------------
    Attachment: 
E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
        Status: Open  (was: Open)

(Note all comments below relate to testing done against GIT SHA 
d0d71a7b38f87f3c70551678b2eab62d925483ef)

I'm attaching an example 
OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt 
using seed=E7F93005B9386058

While the test "officially" fails at 
{{ChaosMonkeySafeLeaderWithPullReplicasTest.java:210}} this is just an 
assertion that none of the {{StoppableIndexingThread}} encountered any 
failures. If we search the test for the first instance of {{REQUEST FAILED}} 
logged by any {{StoppableIndexingThread}} we see the following: Note the port 
number of request URI in the error...
{noformat}
  2> 25374 ERROR (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.StoppableIndexingThread REQUEST FAILED for id=0-14
  2>           => org.apache.solr.client.solrj.SolrClient$RemoteSolrException: 
Error from server at http://127.0.0.1:33083/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. <html>
  2> <head>
  2> org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from 
server at http://127.0.0.1:33083/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. <html>
  2> <head>
  2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
  2> <title>Error 503 Service Unavailable</title>
  2> </head>
  2> <body><h2>HTTP ERROR 503 Service Unavailable</h2>
  2> <table>
  2> <tr><th>URI:</th><td>/solr/collection1_shard1_replica_p9/update</td></tr>
  2> <tr><th>STATUS:</th><td>503</td></tr>
  2> <tr><th>MESSAGE:</th><td>Service Unavailable</td></tr>
  2> <tr><th>SERVLET:</th><td>-</td></tr>
  2> </table>
  2> <hr/><a href="https://jetty.org";>Powered by Jetty:// 10.0.22</a><hr/>
  2> 
  2> </body>
  2> </html>
  2> 
  2>    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:651)
 ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:250)
 ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
 ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:515) 
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:497) 
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:524) 
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:466) 
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1151)
 ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:844)
 ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:775)
 ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:285) 
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:305) 
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
  2>    at 
org.apache.solr.cloud.StoppableIndexingThread.indexDocs(StoppableIndexingThread.java:193)
 ~[solr-test-framework-10.0.0-SNAPSHOT.jar:?]
  2>    at 
org.apache.solr.cloud.StoppableIndexingThread.run(StoppableIndexingThread.java:124)
 [solr-test-framework-10.0.0-SNAPSHOT.jar:?]
{noformat}
If we go backwards a little bit in the log file, looking specifically for port 
number {{33083}} we see...
{noformat}
  2> 25336 INFO  (ChaosMonkey) [n: c: s: r: x: t:] o.a.s.c.ChaosMonkey monkey: 
chose a victim! 33083
  2> 25337 INFO  (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.Server Stopped 
Server@69659ae4{STOPPING}[10.0.22,sto=15000]
  2> 25337 INFO  (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.Server Shutdown 
Server@69659ae4{STOPPING}[10.0.22,sto=15000]
...  
  2> 25373 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient request was not communication error it seems
  2> 25373 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from 
server at http://127.0.0.1:33083/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. <html>
  2> <head>
  2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
  2> <title>Error 503 Service Unavailable</title>
  2> </head>
  2> <body><h2>HTTP ERROR 503 Service Unavailable</h2>
  2> <table>
  2> <tr><th>URI:</th><td>/solr/collection1_shard1_replica_p9/update</td></tr>
  2> <tr><th>STATUS:</th><td>503</td></tr>
  2> <tr><th>MESSAGE:</th><td>Service Unavailable</td></tr>
  2> <tr><th>SERVLET:</th><td>-</td></tr>
  2> </table>
  2> <hr/><a href="https://jetty.org";>Powered by Jetty:// 10.0.22</a><hr/>
  2> 
  2> </body>
  2> </html>
  2> , retry=0 maxRetries=5 commError=false errorCode=503 
  2> 25374 INFO  (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.AbstractConnector 
Stopped ServerConnector@37112d56{HTTP/1.1, (http/1.1, h2c)}{127.0.0.1:0}
{noformat}
Here's what seems to be happening AFAICT:
 * FYI: This test seems to put a single replica on each jetty for simplicity
 * The test randomization can randomly decide to use a CloudSolrClient that has 
{{isUpdatesToLeaders()==false}}
 ** When that happens, updates may be sent to a pull replica, which is expected 
to forward the request to it's leader
 * _Before_ we had graceful jetty shutdown:
 ** If ChaosMonkey decided to shutdown a pull-replica node, any existing HTTP 
connections from a StoppableIndexingThread would fail immediately with a 
connection level error
 ** At which point the CloudSolrClient used by the StoppableIndexingThread 
considers it a "communications error" and will retry a different replica/jetty
 * _After_ the addition of graceful jetty shutdown:
 ** if ChaosMonkey decided to shutdown a pull-replica node, any existing HTTP 
connections from a StoppableIndexingThread stay alive until the server is 
"finished" with the HTTP request
 ** But logic in solr may decide to "finish" the request prematurely by 
throwing a 503 SolrException if it notices CoreContainer shutdown is in 
progress. Example: {{DistributedZkUpdateProcessor.zkCheck()}}
 ** At which point the CloudSolrClient used by the StoppableIndexingThread gets 
a  {{RemoteSolrException}} , and does _not_ consider that to be a communcation 
error, and does _not_ retry the update with a different replica/jetty

 
----
 

While it would be easy to "fix" this test by forcing 
{{{}isUpdatesToLeaders()==true{}}}, I'm not sure what the best "fix" is for the 
underlying behavior in Solr/SolrJ is?
 * Should code in Solr that checks for shutdown and throws a 503 _stop_ doing 
that so that in flight requests can finish?
 * Should CloudSolrClient retry on any 503 ?
 * Should we revert the use of jetty's graceful shutdown (or change it's 
default to 'false') ?
 ** This seems kind of insane, but at the moment: the behavior of 
CloudSolrClient + solr-core shutdown checks means CloudSolrClient is more 
likely to retry other nodes w/o  jetty's graceful shutdown logic in place.

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-17764
>                 URL: https://issues.apache.org/jira/browse/SOLR-17764
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to