[jira] [Commented] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

ASF subversion and git services (JIRA) Mon, 08 Jul 2019 09:00:06 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880493#comment-16880493
 ]


ASF subversion and git services commented on SOLR-13599:
--------------------------------------------------------

Commit 4fd1850d2ee2976efe4e1ee5645d32dc394714b1 in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4fd1850 ]

SOLR-13599: additional 'checkpoint' logging to try and help diagnose strange 
failures

(cherry picked from commit b4a602f6b24196273adbdb7d47bf42fa8d08d807)


> ReplicationFactorTest high failure rate on Windows jenkins VMs after 
> 2019-06-22 OS/java upgrades
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13599
>                 URL: https://issues.apache.org/jira/browse/SOLR-13599
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
>
>
> We've started seeing some weirdly consistent (but not reliably reproducible) 
> failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
> machines.
> The failures all seem to have started on June 22 -- when Uwe upgraded his 
> Windows VMs to upgrade the Java version, but happen across all versions of 
> java tested, and on both the master and branch_8x.
> While this test failed a total of 5 times, in different ways, on various 
> jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on 
> all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and 
> when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins 
> builds frequently fails anywhere from 1-4 additional times.
> All of these failures occur in the exact same place, with the exact same 
> assertion: that the expected replicationFactor of 2 was not achieved, and an 
> rf=1 (ie: only the master) was returned, when sending a _batch_ of documents 
> to a collection with 1 shard, 3 replicas; while 1 of the replicas was 
> partitioned off due to a closed proxy.
> In the handful of logs I've examined closely, the 2nd "live" replica does in 
> fact log that it recieved & processed the update, but with a QTime of over 30 
> seconds, and it then it immediately logs an 
> {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
> meanwhile, the leader has one ({{updateExecutor}} thread logging copious 
> amount of {{java.net.ConnectException: Connection refused: no further 
> information}} regarding the replica that was partitioned off, before a second 
> {{updateExecutor}} thread ultimately logs 
> {{java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
> replica.
> ----
> What makes this perplexing is that this is not the first time in the test 
> that documents were added to this collection while one replica was 
> partitioned off, but it is the first time that all 3 of the following are 
> true _at the same time_:
> # the collection has recovered after some replicas were partitioned and 
> re-connected
> # a batch of multiple documents is being added
> # one replica has been "re" partitioned.
> ...prior to the point when this failure happens, only individual document 
> adds were tested while replicas where partitioned.  Batches of adds were only 
> tested when all 3 replicas were "live" after the proxies were re-opened and 
> the collection had fully recovered.  The failure also comes from the first 
> update to happen after a replica's proxy port has been "closed" for the 
> _second_ time.
> While this conflagration of events might concievible trigger some weird bug, 
> what makes these failures _particularly_ perplexing is that:
> * the failures only happen on Windows
> * the failures only started after the Windows VM update on June-22.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

Reply via email to