[
https://issues.apache.org/jira/browse/CASSANDRA-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927420#comment-17927420
]
Dmitry Konstantinov edited comment on CASSANDRA-20251 at 2/15/25 5:02 PM:
--------------------------------------------------------------------------
I have analyzed the logs for the failed run and think that the root cause is
the following: an implicit assumption made in ReadSpeculationTest is not always
true for TCM. The test expects that we always try to send a read request from
1st node to 2nd and then (because we drop messages between 1s and 2nd nodes) we
will do a speculative retry to 3rd node.
The order is defined by ReplicaPlan constructed based on Snitch information in
pre-5.1 versions and using TCM information in 5.1 (if it is enabled).
It looks like with TCM we may get a different order sometimes, below is an
extract from node system logs for the failed test run, here we have NodeId with
a different order compared to test instances num/IPs:
{code:java}
INFO [node1_isolatedExecutor:1] node1 2025-01-27 10:44:56,612
Register.java:128 - Registered with endpoint /127.0.0.1:7012, node id:
NodeId{id=1}
INFO [node2_isolatedExecutor:2] node2 2025-01-27 10:45:01,899
Register.java:128 - Registered with endpoint /127.0.0.2:7012, node id:
NodeId{id=3}
INFO [node3_isolatedExecutor:2] node3 2025-01-27 10:45:01,683
Register.java:128 - Registered with endpoint /127.0.0.3:7012, node id:
NodeId{id=2} {code}
was (Author: dnk):
I have analyzed the logs for the failed run and think that the root cause is
that an implicit assumption made in ReadSpeculationTest not always true for
TCM: the test expects that we always try to send read request from 1st node to
2nd and then (because we drop messages between 1s and 2nd nodes) we will do a
speculative retry to 3rd node.
The order is defined by ReplicaPlan constructed based on Snitch information in
pre-5.1 versions and using TCM information in 5.1 (if it is enabled).
It looks like with TCM we may get a different order sometimes, below is an
extract from node system logs for the failed run, we have NodeId with a
different order compared to test instances num/IPs:
{code:java}
INFO [node1_isolatedExecutor:1] node1 2025-01-27 10:44:56,612
Register.java:128 - Registered with endpoint /127.0.0.1:7012, node id:
NodeId{id=1}
INFO [node2_isolatedExecutor:2] node2 2025-01-27 10:45:01,899
Register.java:128 - Registered with endpoint /127.0.0.2:7012, node id:
NodeId{id=3}
INFO [node3_isolatedExecutor:2] node3 2025-01-27 10:45:01,683
Register.java:128 - Registered with endpoint /127.0.0.3:7012, node id:
NodeId{id=2} {code}
> Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-20251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20251
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments:
> TEST-org.apache.cassandra.distributed.test.ReadSpeculationTest.xml,
> system_node1.log, system_node2.log, system_node3.log
>
>
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/5285/workflows/20d3f23b-d9e5-4130-8c28-d87682f919de/jobs/329400/tests]
> {code:java}
> junit.framework.AssertionFailedError:
> Expecting actual:
> 6434477L
> to be greater than:
> 2000000000L
> at
> org.apache.cassandra.distributed.test.ReadSpeculationTest$TestScenario.assertWillSpeculate(ReadSpeculationTest.java:172)
> at
> org.apache.cassandra.distributed.test.ReadSpeculationTest.lambda$speculateTest$81c80a4a$2(ReadSpeculationTest.java:74)
> at
> org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
> at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
> at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833) {code}
> Present in Butler as well:
> https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/org.apache.cassandra.distributed.test/ReadSpeculationTest/speculateTest
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]