[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

Hoss Man (JIRA) Tue, 02 Jul 2019 11:36:30 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-13599:
----------------------------
    Description: 
We've started seeing some weirdly consistent (but not reliably reproducible) 
failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
machines.

The failures all seem to have started on June 22 -- when Uwe upgraded his 
Windows VMs to upgrade the Java version, but happen across all versions of java 
tested, and on both the master and branch_8x.

While this test failed a total of 5 times, in different ways, on various 
jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all 
but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it 
fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds 
frequently fails anywhere from 1-4 additional times.

All of these failures occur in the exact same place, with the exact same 
assertion: that the expected replicationFactor of 2 was not achieved, and an 
rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to 
a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned 
off due to a closed proxy.

In the handful of logs I've examined closely, the 2nd "live" replica does in 
fact log that it recieved & processed the update, but with a QTime of over 30 
seconds, and it then it immediately logs an 
{{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount 
of {{java.net.ConnectException: Connection refused: no further information}} 
regarding the replica that was partitioned off, before a second 
{{updateExecutor}} thread ultimately logs 
{{java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
replica.


----

What makes this perplexing is that this is not the first time in the test that 
documents were added to this collection while one replica was partitioned off, 
but it is the first time that all 3 of the following are true _at the same 
time_:

# the collection has recovered after some replicas were partitioned and 
re-connected
# a batch of multiple documents is being added
# one replica has been "re" partitioned.

...prior to the point when this failure happens, only individual document adds 
were tested while replicas where partitioned.  Batches of adds were only tested 
when all 3 replicas were "live" after the proxies were re-opened and the 
collection had fully recovered.  The failure also comes from the first update 
to happen after a replica's proxy port has been "closed" for the _second_ time.

While this conflagration of events might concievible trigger some weird bug, 
what makes these failures _particularly_ perplexing is that:
* the failures only happen on Windows
* the failures only started after the Windows VM update on June-22.



  was:

We've started seeing some weirdly consistent (but not reliably reproducible) 
failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
machines.

The failures all seem to have started on June 22 -- when Uwe upgraded his 
Windows VMs to upgrade the Java version, but happen across all versions of java 
tested, and on both the master and branch_8x.

While this test failed a total of 5 times, in different ways, on various 
jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all 
but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it 
fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds 
frequently fails anywhere from 1-4 additional times.

All of these failures occur in the exact same place, with the exact same 
assertion: that the expected replicationFactor of 2 was not achieved, and an 
rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to 
a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned 
off due to a closed proxy.

In the handful of logs I've examined closely, the 2nd "live" replica does in 
fact log that it recieved & processed the update, but with a QTime of over 30 
seconds, and it then it immediately logs an 
{{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount 
of {{java.net.ConnectException: Connection refused: no further information}} 
regarding the replica that was partitioned off, before a second 
{{updateExecutor}} thread ultimately logs 
{{java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
replica.


> ReplicationFactorTest high failure rate on Windows jenkins VMs after 
> 2019-06-22 OS/java upgrades
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13599
>                 URL: https://issues.apache.org/jira/browse/SOLR-13599
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>
> We've started seeing some weirdly consistent (but not reliably reproducible) 
> failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
> machines.
> The failures all seem to have started on June 22 -- when Uwe upgraded his 
> Windows VMs to upgrade the Java version, but happen across all versions of 
> java tested, and on both the master and branch_8x.
> While this test failed a total of 5 times, in different ways, on various 
> jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on 
> all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and 
> when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins 
> builds frequently fails anywhere from 1-4 additional times.
> All of these failures occur in the exact same place, with the exact same 
> assertion: that the expected replicationFactor of 2 was not achieved, and an 
> rf=1 (ie: only the master) was returned, when sending a _batch_ of documents 
> to a collection with 1 shard, 3 replicas; while 1 of the replicas was 
> partitioned off due to a closed proxy.
> In the handful of logs I've examined closely, the 2nd "live" replica does in 
> fact log that it recieved & processed the update, but with a QTime of over 30 
> seconds, and it then it immediately logs an 
> {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
> meanwhile, the leader has one ({{updateExecutor}} thread logging copious 
> amount of {{java.net.ConnectException: Connection refused: no further 
> information}} regarding the replica that was partitioned off, before a second 
> {{updateExecutor}} thread ultimately logs 
> {{java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
> replica.
> ----
> What makes this perplexing is that this is not the first time in the test 
> that documents were added to this collection while one replica was 
> partitioned off, but it is the first time that all 3 of the following are 
> true _at the same time_:
> # the collection has recovered after some replicas were partitioned and 
> re-connected
> # a batch of multiple documents is being added
> # one replica has been "re" partitioned.
> ...prior to the point when this failure happens, only individual document 
> adds were tested while replicas where partitioned.  Batches of adds were only 
> tested when all 3 replicas were "live" after the proxies were re-opened and 
> the collection had fully recovered.  The failure also comes from the first 
> update to happen after a replica's proxy port has been "closed" for the 
> _second_ time.
> While this conflagration of events might concievible trigger some weird bug, 
> what makes these failures _particularly_ perplexing is that:
> * the failures only happen on Windows
> * the failures only started after the Windows VM update on June-22.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

Reply via email to