[
https://issues.apache.org/jira/browse/CASSANDRA-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928049#comment-17928049
]
Dmitry Konstantinov edited comment on CASSANDRA-20251 at 2/18/25 11:47 AM:
---------------------------------------------------------------------------
Updated MR: [https://github.com/apache/cassandra/pull/3901]
the repeated test is passing now:
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/5461/workflows/a9e3bf40-6563-4a59-8f00-3c90936492df/jobs/344706]
The root cause is a combination of 2 factors:
* Java dest starts 2nd and 3rd nodes in parallel and they can be added to TCM
in a different order, so when a read plan is created we may get 2nd or 3rd node
as a the second replica. The test expects 2nd node to be the initial remote
read replica and emulates a network connectivity failure between 1st node and
it. To fix the non-deterministic behaviour NetworkTopologyProximity
implementation is adjusted using ByteBuddy to make the expected order of nodes
returned by it (the same idea as in python read repair dtests is used).
* Dynamic snitch was enabled and it may shuffle the nodes returned by
NetworkTopologyProximity breaking the test assumption as well. To fix it
dynamic snitch is disabled.
Note: To simplify troubleshooting of java dtests I have added a logging of
actual configuration in java dtests (the default server logic does not print it
because logging itself is initialized in tests later + the config is overridden)
was (Author: dnk):
Updated MR: [https://github.com/apache/cassandra/pull/3901]
the repeated test is passing now:
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/5461/workflows/a9e3bf40-6563-4a59-8f00-3c90936492df/jobs/344706]
Root cause is a combination of 2 factors:
* Java dest starts 2nd and 3rd nodes in parallel and they can be added to TCM
in a different order, so when a read plan is created we may get 2nd or 3rd node
as a the second replica. The test expects 2nd node to be the initial remote
read replica and emulate network connectivity failure to it. To fix it
NetworkTopologyProximity implementation is adjusted using ByteBuddy to make the
expected order of nodes returned by it (the same idea as in python read repair
dtests is used).
* Dynamic snitch was enabled and it may shuffle the nodes returned by
NetworkTopologyProximity breaking the test assumption as well. To fix it
dynamic snitch is disabled.
Note: To simplify troubleshooting of java dtests I have added a logging of
actual configuration in java dtests (the default server logic does not print it
because logging itself is initialized in tests later + the config is overridden)
> Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-20251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20251
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments:
> TEST-org.apache.cassandra.distributed.test.ReadSpeculationTest.xml,
> system_node1.log, system_node2.log, system_node3.log
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/5285/workflows/20d3f23b-d9e5-4130-8c28-d87682f919de/jobs/329400/tests]
> {code:java}
> junit.framework.AssertionFailedError:
> Expecting actual:
> 6434477L
> to be greater than:
> 2000000000L
> at
> org.apache.cassandra.distributed.test.ReadSpeculationTest$TestScenario.assertWillSpeculate(ReadSpeculationTest.java:172)
> at
> org.apache.cassandra.distributed.test.ReadSpeculationTest.lambda$speculateTest$81c80a4a$2(ReadSpeculationTest.java:74)
> at
> org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
> at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
> at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833) {code}
> Present in Butler as well:
> https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/org.apache.cassandra.distributed.test/ReadSpeculationTest/speculateTest
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]