[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

Hoss Man (JIRA) Tue, 02 Jul 2019 11:37:09 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-13599:
----------------------------
    Attachment: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
        Status: Open  (was: Open)


Details of Uwe's jenkins updates...

* 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E
* 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/%3C01a901d530a7$fac9d2a0$f05d77e0$@thetaphi.de%3E
* 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/raw/%3C01a901d530a7$fac9d2a0$f05d77e0$@thetaphi.de%3E/4

----

I'm attaching thetaphi_Lucene-Solr-master-Windows_8025.log.txt as an 
illustrative example of the failure, here are some key snippets and the 
associated lines from the test class...


{noformat}

# Previously: test individual adds, delById, and delByyQ using...
#  ... rf=3 with all replicas connected,
#  ... rf=2 when one replica's proxy is closed,
#  ... rf=1 when both replica's proxies are closed

# Lines # 314-320 - "heal" the cluster (re-enable all proxies)

...
   [junit4]   2> 555732 INFO  
(TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [     ] 
o.a.s.c.AbstractFullDistribZkTestBase Found 3 replicas and leader on 
127.0.0.1:59004_ for shard1 in repfacttest_c8n_1x3
   [junit4]   2> 555732 INFO  
(TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [     ] 
o.a.s.c.AbstractFullDistribZkTestBase Took 7107.0 ms to see all replicas become 
active.
...


# Lines # 322-326 - checks that (individual) add, delById & delByQ all get rf=3

# Lines # 328-341 - checks that (batched) add, delById & delByQ all get rf=3

# Line #  344 - close a proxy port (59108) again ...

   [junit4]   2> 556060 WARN  
(TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [     ] 
o.a.s.c.s.c.SocketProxy Closing 1 connections to: http://127.0.0.1:59108/, 
target: http://127.0.0.1:59109/
{noformat}

At this point, the next thing in the test is to add a batch of documents 
(ids#15-29) while one replica is partitioned -- but I should point out that 
it's not immediately obvious to me if the {{(updateExecutor-1924-thread-4}} 
logging from the leader below (complaining about {{Connection refused:}} to 
port 59108 is *because* of the update sent my the client, or independently 
because of the HTTP2 connection management detecting that the proxy was 
closed...

{noformat}
# Lines # 346-355 - send our first "batch" (id#15-29) when cluster isn't 
"healed"

   [junit4]   2> 558074 ERROR 
(updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2
 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) 
[n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 
x:repfacttest_c8n_1x3_shard1_replica_n2 ] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ to 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/
   [junit4]   2>           => java.io.IOException: java.net.ConnectException: 
Connection refused: no further information
...

# ...there are more details about supressed exceptions
# ...this ERROR repeats many times - evidently as the leader tries to 
reconnect...

...
   [junit4]   2> 560193 ERROR 
(updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2
 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) 
[n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 
x:repfacttest_c8n_1x3_shard1_replica_n2 ] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ to 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/
   [junit4]   2>           => java.io.IOException: java.net.ConnectException: 
Connection refused: no further information
...

# ... brief bit of path=/admin/metrics logging from both n:127.0.0.1:59004_ and 
n:127.0.0.1:59084_
# ... and some other MetricsHistoryHandler logging (from overseer?) about 
failing to talk to 127.0.0.1:59108
# ... but mostly lots of logging from the leader about not being able to 
connect to 127.0.0.1:59108



# live replica (port 59060) logs that it's added the 15 docs FROMLEADER, ... 
BUT!!!!...
# ... same thread then logs jetty EofException: Reset cancel_stream_error
# ... so aparently it added the docs but had a problem communicating that back 
to the leader
# ... evidently because it took 30 seconds (QTime = 30013) and leader gave up 
(see below)

   [junit4]   2> 591364 INFO  (qtp1520091886-5884) [n:127.0.0.1:59060_ 
c:repfacttest_c8n_1x3 s:shard1 r:core_node4 
x:repfacttest_c8n_1x3_shard1_replica_n1 ] o.a.s.u.p.LogUpdateProcessorFactory 
[repfacttest_c8n_1x3_shard1_replica_n1]  webapp= path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:59004/repfacttest_c8n_1x3_shard1_replica_n2/&wt=javabin&version=2}{add=[15
 (1637713552307388416), 16 (1637713552307388417), 17 (1637713552307388418), 18 
(1637713552308436992), 19 (1637713552308436993), 20 (1637713552308436994), 21 
(1637713552308436995), 22 (1637713552308436996), 23 (1637713552308436997), 24 
(1637713552308436998), ... (15 adds)]} 0 30013
   [junit4]   2> 591367 ERROR (qtp1520091886-5884) [n:127.0.0.1:59060_ 
c:repfacttest_c8n_1x3 s:shard1 r:core_node4 
x:repfacttest_c8n_1x3_shard1_replica_n1 ] o.a.s.h.RequestHandlerBase 
org.eclipse.jetty.io.EofException: Reset cancel_stream_error
   [junit4]   2>        at 
org.eclipse.jetty.http2.server.HTTP2ServerConnectionFactory$HTTPServerSessionListener.onReset(HTTP2ServerConnectionFactory.java:157)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Stream.notifyReset(HTTP2Stream.java:574)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Stream.onReset(HTTP2Stream.java:343)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Stream.process(HTTP2Stream.java:252)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Session.onReset(HTTP2Session.java:294)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.Parser$Listener$Wrapper.onReset(Parser.java:368)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.BodyParser.notifyReset(BodyParser.java:139)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.ResetBodyParser.onReset(ResetBodyParser.java:97)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.ResetBodyParser.parse(ResetBodyParser.java:66)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.Parser.parseBody(Parser.java:194)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.Parser.parse(Parser.java:123)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.ServerParser.parse(ServerParser.java:115)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:248)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:357)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:181)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:170)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:125)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:348)
   [junit4]   2>        at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
   [junit4]   2>        at 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.Invocable.invokeNonBlocking(Invocable.java:68)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.invokeTask(EatWhatYouKill.java:345)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:300)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
   [junit4]   2>        at java.base/java.lang.Thread.run(Thread.java:830)
   [junit4]   2>        Suppressed: java.lang.Throwable: HttpInput failure
   [junit4]   2>                at 
org.eclipse.jetty.server.HttpInput.failed(HttpInput.java:831)
   [junit4]   2>                at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.onFailure(HttpChannelOverHTTP2.java:323)
   [junit4]   2>                at 
org.eclipse.jetty.http2.server.HTTP2ServerConnection.onStreamFailure(HTTP2ServerConnection.java:219)
   [junit4]   2>                ... 30 more
   [junit4]   2> 

# FYI: same request thread on port 59060 also logs same exception from 
o.a.s.s.HttpSolrCall ...

   [junit4]   2> 591367 ERROR (qtp1520091886-5884) [n:127.0.0.1:59060_ 
c:repfacttest_c8n_1x3 s:shard1 r:core_node4 
x:repfacttest_c8n_1x3_shard1_replica_n1 ] o.a.s.s.HttpSolrCall 
null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error
   [junit4]   2>        at 
org.eclipse.jetty.http2.server.HTTP2ServerConnectionFactory$HTTPServerSessionListener.onReset(HTTP2ServerConnectionFactory.java:157)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Stream.notifyReset(HTTP2Stream.java:574)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Stream.onReset(HTTP2Stream.java:343)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Stream.process(HTTP2Stream.java:252)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Session.onReset(HTTP2Session.java:294)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.Parser$Listener$Wrapper.onReset(Parser.java:368)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.BodyParser.notifyReset(BodyParser.java:139)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.ResetBodyParser.onReset(ResetBodyParser.java:97)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.ResetBodyParser.parse(ResetBodyParser.java:66)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.Parser.parseBody(Parser.java:194)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.Parser.parse(Parser.java:123)
   [junit4]   2>        at 
org.eclipse.jetty.http2.parser.ServerParser.parse(ServerParser.java:115)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:248)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:357)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:181)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:170)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:125)
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:348)
   [junit4]   2>        at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
   [junit4]   2>        at 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.Invocable.invokeNonBlocking(Invocable.java:68)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.invokeTask(EatWhatYouKill.java:345)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:300)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
   [junit4]   2>        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
   [junit4]   2>        at java.base/java.lang.Thread.run(Thread.java:830)
   [junit4]   2>        Suppressed: java.lang.Throwable: HttpInput failure
   [junit4]   2>                at 
org.eclipse.jetty.server.HttpInput.failed(HttpInput.java:831)
   [junit4]   2>                at 
org.eclipse.jetty.http2.server.HttpChannelOverHTTP2.onFailure(HttpChannelOverHTTP2.java:323)
   [junit4]   2>                at 
org.eclipse.jetty.http2.server.HTTP2ServerConnection.onStreamFailure(HTTP2ServerConnection.java:219)
   [junit4]   2>                ... 30 more

# leader's updateExecutor-1924-thread-4 complains many more times that it isn't 
able to update
# (the still down) 127.0.0.1:59108 ...


   [junit4]   2> 591661 ERROR 
(updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2
 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) 
[n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 
x:repfacttest_c8n_1x3_shard1_replica_n2 ] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: http://127.0.0.1:59
108/repfacttest_c8n_1x3_shard1_replica_n3/ to 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/
   [junit4]   2>           => java.io.IOException: java.net.ConnectException: 
Connection refused: no further information
   [junit4]   2>        at 
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193)
   ...

# Eventually (a different) updateExecutor-1924-thread-2 on the leader also 
complains it
# couldn't send the update to port 59060 because of a "TimeoutException: 
idle_timeout"
# (which is either a cause or effect of 59060's "EofException: Reset 
cancel_stream_error" above)


   [junit4]   2> 591661 ERROR 
(updateExecutor-1924-thread-2-processing-x:repfacttest_c8n_1x3_shard1_replica_n2
 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) 
[n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 
x:repfacttest_c8n_1x3_shard1_replica_n2 ] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: 
http://127.0.0.1:59060/repfacttest_c8n_1x3_shard1_replica_n1/ to 
http://127.0.0.1:59060/repfacttest_c8n_1x3_shard1_replica_n1/
   [junit4]   2>           => java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout
   [junit4]   2>        at 
org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:221)
   [junit4]   2> java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout
   [junit4]   2>        at 
org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:221)
 ~[jetty-client-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:240)
 ~[java/:?]
   [junit4]   2>        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
 ~[java/:?]
   [junit4]   2>        at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
 ~[metrics-core-4.0.5.jar:4.0.5]
   [junit4]   2>        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
 ~[java/:?]
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
   [junit4]   2>        at java.lang.Thread.run(Thread.java:830) [?:?]
   [junit4]   2> Caused by: java.util.concurrent.TimeoutException: idle_timeout
   [junit4]   2>        at 
org.eclipse.jetty.http2.client.http.HttpConnectionOverHTTP2.onIdleTimeout(HttpConnectionOverHTTP2.java:137)
 ~[http2-http-client-transport-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2$SessionListenerPromise.onIdleTimeout(HttpClientTransportOverHTTP2.java:243)
 ~[http2-http-client-transport-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Session.notifyIdleTimeout(HTTP2Session.java:1165) 
~[http2-common-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Session.onIdleTimeout(HTTP2Session.java:1003) 
~[http2-common-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.http2.HTTP2Connection.onIdleExpired(HTTP2Connection.java:150) 
~[http2-common-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:401) 
~[jetty-io-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) 
~[jetty-io-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) 
~[jetty-io-9.4.19.v20190610.jar:9.4.19.v20190610]
   [junit4]   2>        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
   [junit4]   2>        at 
java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
   [junit4]   2>        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 ~[?:?]
   [junit4]   2>        ... 3 more


# lots more logging from the leader about being unable to talk to the (down) 
127.0.0.1:59108
{noformat}


> ReplicationFactorTest high failure rate on Windows jenkins VMs after 
> 2019-06-22 OS/java upgrades
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13599
>                 URL: https://issues.apache.org/jira/browse/SOLR-13599
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
>
>
> We've started seeing some weirdly consistent (but not reliably reproducible) 
> failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
> machines.
> The failures all seem to have started on June 22 -- when Uwe upgraded his 
> Windows VMs to upgrade the Java version, but happen across all versions of 
> java tested, and on both the master and branch_8x.
> While this test failed a total of 5 times, in different ways, on various 
> jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on 
> all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and 
> when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins 
> builds frequently fails anywhere from 1-4 additional times.
> All of these failures occur in the exact same place, with the exact same 
> assertion: that the expected replicationFactor of 2 was not achieved, and an 
> rf=1 (ie: only the master) was returned, when sending a _batch_ of documents 
> to a collection with 1 shard, 3 replicas; while 1 of the replicas was 
> partitioned off due to a closed proxy.
> In the handful of logs I've examined closely, the 2nd "live" replica does in 
> fact log that it recieved & processed the update, but with a QTime of over 30 
> seconds, and it then it immediately logs an 
> {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
> meanwhile, the leader has one ({{updateExecutor}} thread logging copious 
> amount of {{java.net.ConnectException: Connection refused: no further 
> information}} regarding the replica that was partitioned off, before a second 
> {{updateExecutor}} thread ultimately logs 
> {{java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
> replica.
> ----
> What makes this perplexing is that this is not the first time in the test 
> that documents were added to this collection while one replica was 
> partitioned off, but it is the first time that all 3 of the following are 
> true _at the same time_:
> # the collection has recovered after some replicas were partitioned and 
> re-connected
> # a batch of multiple documents is being added
> # one replica has been "re" partitioned.
> ...prior to the point when this failure happens, only individual document 
> adds were tested while replicas where partitioned.  Batches of adds were only 
> tested when all 3 replicas were "live" after the proxies were re-opened and 
> the collection had fully recovered.  The failure also comes from the first 
> update to happen after a replica's proxy port has been "closed" for the 
> _second_ time.
> While this conflagration of events might concievible trigger some weird bug, 
> what makes these failures _particularly_ perplexing is that:
> * the failures only happen on Windows
> * the failures only started after the Windows VM update on June-22.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

Reply via email to