[jira] [Comment Edited] (CASSANDRA-20251) Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest

Dmitry Konstantinov (Jira) Sat, 15 Feb 2025 09:13:14 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927420#comment-17927420
 ]


Dmitry Konstantinov edited comment on CASSANDRA-20251 at 2/15/25 5:02 PM:
--------------------------------------------------------------------------

I have analyzed the logs for the failed run and think that the root cause is 
the following: an implicit assumption made in ReadSpeculationTest is not always 
true for TCM. The test expects that we always try to send a read request from 
1st node to 2nd and then (because we drop messages between 1s and 2nd nodes) we 
will do a speculative retry to 3rd node.

The order is defined by ReplicaPlan constructed based on Snitch information in 
pre-5.1 versions and using TCM information in 5.1 (if it is enabled).

It looks like with TCM we may get a different order sometimes, below is an 
extract from node system logs for the failed test run, here we have NodeId with 
a different order compared to test instances num/IPs:
{code:java}
INFO  [node1_isolatedExecutor:1] node1 2025-01-27 10:44:56,612 
Register.java:128 - Registered with endpoint /127.0.0.1:7012, node id: 
NodeId{id=1}
INFO  [node2_isolatedExecutor:2] node2 2025-01-27 10:45:01,899 
Register.java:128 - Registered with endpoint /127.0.0.2:7012, node id: 
NodeId{id=3}
INFO  [node3_isolatedExecutor:2] node3 2025-01-27 10:45:01,683 
Register.java:128 - Registered with endpoint /127.0.0.3:7012, node id: 
NodeId{id=2} {code}


was (Author: dnk):
I have analyzed the logs for the failed run and think that the root cause is 
that an implicit assumption made in ReadSpeculationTest not always true for 
TCM: the test expects that we always try to send read request from 1st node to 
2nd and then (because we drop messages between 1s and 2nd nodes) we will do a 
speculative retry to 3rd node.

The order is defined by ReplicaPlan constructed based on Snitch information in 
pre-5.1 versions and using TCM information in 5.1 (if it is enabled).

It looks like with TCM we may get a different order sometimes, below is an 
extract from node system logs for the failed run, we have NodeId with a 
different order compared to test instances num/IPs:
{code:java}
INFO  [node1_isolatedExecutor:1] node1 2025-01-27 10:44:56,612 
Register.java:128 - Registered with endpoint /127.0.0.1:7012, node id: 
NodeId{id=1}
INFO  [node2_isolatedExecutor:2] node2 2025-01-27 10:45:01,899 
Register.java:128 - Registered with endpoint /127.0.0.2:7012, node id: 
NodeId{id=3}
INFO  [node3_isolatedExecutor:2] node3 2025-01-27 10:45:01,683 
Register.java:128 - Registered with endpoint /127.0.0.3:7012, node id: 
NodeId{id=2} {code}

> Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-20251
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20251
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 
> TEST-org.apache.cassandra.distributed.test.ReadSpeculationTest.xml, 
> system_node1.log, system_node2.log, system_node3.log
>
>
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/5285/workflows/20d3f23b-d9e5-4130-8c28-d87682f919de/jobs/329400/tests]
> {code:java}
> junit.framework.AssertionFailedError: 
> Expecting actual:
>   6434477L
> to be greater than:
>   2000000000L
>       at 
> org.apache.cassandra.distributed.test.ReadSpeculationTest$TestScenario.assertWillSpeculate(ReadSpeculationTest.java:172)
>       at 
> org.apache.cassandra.distributed.test.ReadSpeculationTest.lambda$speculateTest$81c80a4a$2(ReadSpeculationTest.java:74)
>       at 
> org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.base/java.lang.Thread.run(Thread.java:833) {code}
> Present in Butler as well: 
> https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/org.apache.cassandra.distributed.test/ReadSpeculationTest/speculateTest



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20251) Flaky test - org.apache.cassandra.distributed.test.ReadSpeculationTest

Reply via email to