[ 
https://issues.apache.org/jira/browse/CASSANDRA-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928049#comment-17928049
 ] 

Dmitry Konstantinov edited comment on CASSANDRA-20251 at 2/18/25 2:02 PM:
--------------------------------------------------------------------------

Updated MR: [https://github.com/apache/cassandra/pull/3901]

the repeated test is passing now: 
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/5461/workflows/a9e3bf40-6563-4a59-8f00-3c90936492df/jobs/344706]

The root cause is a combination of 2 factors:
 * Java dest starts 2nd and 3rd nodes in parallel and they can be added to TCM 
in a different order, so when a read plan is created we may get 2nd or 3rd node 
as the second replica. The test expects 2nd node to be the initial remote read 
replica and emulates a network connectivity failure between 1st node and it. To 
fix the non-deterministic behaviour NetworkTopologyProximity implementation is 
adjusted using ByteBuddy to make the expected order of nodes returned by it 
(the same idea as in python read repair dtests is used).
 * Dynamic snitch was enabled and it may shuffle the nodes returned by 
NetworkTopologyProximity breaking the test assumption as well. To fix it 
dynamic snitch is disabled.

Note: To simplify troubleshooting of java dtests I have added a logging of 
actual configuration in java dtests (the default server logic does not print it 
because logging itself is initialized in tests later + the config is overridden)


was (Author: dnk):
Updated MR: [https://github.com/apache/cassandra/pull/3901]

the repeated test is passing now: 
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/5461/workflows/a9e3bf40-6563-4a59-8f00-3c90936492df/jobs/344706]

The root cause is a combination of 2 factors:
 * Java dest starts 2nd and 3rd nodes in parallel and they can be added to TCM 
in a different order, so when a read plan is created we may get 2nd or 3rd node 
as a the second replica. The test expects 2nd node to be the initial remote 
read replica and emulates a network connectivity failure between 1st node and 
it. To fix the non-deterministic behaviour NetworkTopologyProximity 
implementation is adjusted using ByteBuddy to make the expected order of nodes 
returned by it (the same idea as in python read repair dtests is used).
 * Dynamic snitch was enabled and it may shuffle the nodes returned by 
NetworkTopologyProximity breaking the test assumption as well. To fix it 
dynamic snitch is disabled.

Note: To simplify troubleshooting of java dtests I have added a logging of 
actual configuration in java dtests (the default server logic does not print it 
because logging itself is initialized in tests later + the config is overridden)

> Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-20251
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20251
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 
> TEST-org.apache.cassandra.distributed.test.ReadSpeculationTest.xml, 
> system_node1.log, system_node2.log, system_node3.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/5285/workflows/20d3f23b-d9e5-4130-8c28-d87682f919de/jobs/329400/tests]
> {code:java}
> junit.framework.AssertionFailedError: 
> Expecting actual:
>   6434477L
> to be greater than:
>   2000000000L
>       at 
> org.apache.cassandra.distributed.test.ReadSpeculationTest$TestScenario.assertWillSpeculate(ReadSpeculationTest.java:172)
>       at 
> org.apache.cassandra.distributed.test.ReadSpeculationTest.lambda$speculateTest$81c80a4a$2(ReadSpeculationTest.java:74)
>       at 
> org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.base/java.lang.Thread.run(Thread.java:833) {code}
> Present in Butler as well: 
> https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/org.apache.cassandra.distributed.test/ReadSpeculationTest/speculateTest



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to