[ 
https://issues.apache.org/jira/browse/CASSANDRA-19505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-19505:
----------------------------------------
    Test and Documentation Plan: Test only fix
                         Status: Patch Available  (was: In Progress)

I found a couple of failure modes here:

* It seems that once in every ~30 runs, one peer is extremely unlucky and every 
attempt it makes to contact the one live peer is replaced with a "random" 
injected failure. In this scenario, that node is effectively partitioned and so 
can neither discover any peers nor be discovered. Added checks in the test to 
account for this. 

* It's possible for some nodes to have completed their discovery rounds before 
others have been able to make a single successful contact. In this case, a node 
finishing early won't be able to discover one which starts late. I've 
increased the concurrency to mitigate this.

This has now passed for 1000 consecutive runs locally, so I think we should be 
good to merge it. If any additional failure cases show up, we can revisit.

> Test failure: org.apache.cassandra.tcm.DiscoverySimulationTest
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-19505
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19505
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Transactional Cluster Metadata
>            Reporter: Brandon Williams
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 5.x
>
>
> As seen at 
> https://app.circleci.com/pipelines/github/driftx/cassandra/1551/workflows/be2ca359-49d7-4fb3-959d-454656f48d79/jobs/80604/tests
>  :
> {noformat}
> junit.framework.AssertionFailedError: expected:<[/127.0.100.1:7012, 
> /127.0.100.2:7012, /127.0.100.3:7012, /127.0.100.4:7012, /127.0.100.5:7012, 
> /127.0.100.6:7012, /127.0.100.7:7012, /127.0.100.8:7012, /127.0.100.9:7012, 
> /127.0.100.10:7012]> but was:<[/127.0.100.1:7012, /127.0.100.2:7012, 
> /127.0.100.3:7012, /127.0.100.5:7012, /127.0.100.6:7012, /127.0.100.7:7012, 
> /127.0.100.8:7012, /127.0.100.9:7012, /127.0.100.10:7012]>
>       at 
> org.apache.cassandra.tcm.DiscoverySimulationTest.discoveryTest(DiscoverySimulationTest.java:105)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to