[
https://issues.apache.org/jira/browse/CASSANDRA-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928049#comment-17928049
]
Dmitry Konstantinov edited comment on CASSANDRA-20251 at 2/18/25 2:02 PM:
--------------------------------------------------------------------------
Updated MR: [https://github.com/apache/cassandra/pull/3901]
the repeated test is passing now:
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/5461/workflows/a9e3bf40-6563-4a59-8f00-3c90936492df/jobs/344706]
The root cause is a combination of 2 factors:
* Java dest starts 2nd and 3rd nodes in parallel and they can be added to TCM
in a different order, so when a read plan is created we may get 2nd or 3rd node
as the second replica. The test expects 2nd node to be the initial remote read
replica and emulates a network connectivity failure between 1st node and it. To
fix the non-deterministic behaviour NetworkTopologyProximity implementation is
adjusted using ByteBuddy to make the expected order of nodes returned by it
(the same idea as in python read repair dtests is used).
* Dynamic snitch was enabled and it may shuffle the nodes returned by
NetworkTopologyProximity breaking the test assumption as well. To fix it
dynamic snitch is disabled.
Note: To simplify troubleshooting of java dtests I have added a logging of
actual configuration in java dtests (the default server logic does not print it
because logging itself is initialized in tests later + the config is overridden)
was (Author: dnk):
Updated MR: [https://github.com/apache/cassandra/pull/3901]
the repeated test is passing now:
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/5461/workflows/a9e3bf40-6563-4a59-8f00-3c90936492df/jobs/344706]
The root cause is a combination of 2 factors:
* Java dest starts 2nd and 3rd nodes in parallel and they can be added to TCM
in a different order, so when a read plan is created we may get 2nd or 3rd node
as a the second replica. The test expects 2nd node to be the initial remote
read replica and emulates a network connectivity failure between 1st node and
it. To fix the non-deterministic behaviour NetworkTopologyProximity
implementation is adjusted using ByteBuddy to make the expected order of nodes
returned by it (the same idea as in python read repair dtests is used).
* Dynamic snitch was enabled and it may shuffle the nodes returned by
NetworkTopologyProximity breaking the test assumption as well. To fix it
dynamic snitch is disabled.
Note: To simplify troubleshooting of java dtests I have added a logging of
actual configuration in java dtests (the default server logic does not print it
because logging itself is initialized in tests later + the config is overridden)
> Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-20251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20251
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments:
> TEST-org.apache.cassandra.distributed.test.ReadSpeculationTest.xml,
> system_node1.log, system_node2.log, system_node3.log
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/5285/workflows/20d3f23b-d9e5-4130-8c28-d87682f919de/jobs/329400/tests]
> {code:java}
> junit.framework.AssertionFailedError:
> Expecting actual:
> 6434477L
> to be greater than:
> 2000000000L
> at
> org.apache.cassandra.distributed.test.ReadSpeculationTest$TestScenario.assertWillSpeculate(ReadSpeculationTest.java:172)
> at
> org.apache.cassandra.distributed.test.ReadSpeculationTest.lambda$speculateTest$81c80a4a$2(ReadSpeculationTest.java:74)
> at
> org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
> at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
> at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833) {code}
> Present in Butler as well:
> https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/org.apache.cassandra.distributed.test/ReadSpeculationTest/speculateTest
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]