[jira] [Comment Edited] (CASSANDRA-17461) Test Failure: org.apache.cassandra.distributed.test.CASTest.testConflictingWritesWithStaleRingInformation

Benedict Elliott Smith (Jira) Fri, 19 Aug 2022 08:47:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581928#comment-17581928
 ]


Benedict Elliott Smith edited comment on CASSANDRA-17461 at 8/19/22 3:46 PM:
-----------------------------------------------------------------------------

bq. in node1's view, node4 has been decommissioned/assassinated and is in the 
LEFT state

No, node1's view has node4 as never having joined the ring, as we invoke 
{{unsafeAnnulEndpoint}} after updating the token information via the LEFT 
state. This is just a convenient way to simulate node4 joining without node1 
receiving the gossip state.

bq. any gossip exchange between node1 and node4 will cause the state change 
back to the correct state of NORMAL ... and in the real world, may not get so 
lucky in timing to be from one

You have it backwards. Ordinary operations _depend_ on gossip disseminating 
this promptly for correctness. We are simulating this having not happened yet, 
leading to an inconsistent view of the token ring and permitting operations to 
reach consensus without overlapping quorums, due to token ownership 
disagreements. Paxos operations (which must be linearizable) are now able to 
detect this inconsistency themselves and enforce it, by updating the gossip 
state directly.

Note that the {{Timeout}} is also a _valid outcome_ for this test, just a bad 
one for validating everything else is working correctly. The reason the 
{{Timeout}} occurs is that we explicitly drop messages to one of the _correct_ 
owners to ensure we can only succeed by contacting the _new_ owner, and if this 
race occurs we drop this message too, leaving only one correct owner that can 
respond.

bq. it looks like the other 'stale ring' tests are explicitly adding node4 
back, only this one does not.

Yes, they are testing other things.




was (Author: benedict):
bq. in node1's view, node4 has been decommissioned/assassinated and is in the 
LEFT state

No, in node1's view node4 has never joined the ring, as we invoke 
{{unsafeAnnulEndpoint}} after updating the token information via the LEFT 
state. This is just a convenient way to simulate node4 joining without node1 
receiving the gossip state.

bq. any gossip exchange between node1 and node4 will cause the state change 
back to the correct state of NORMAL ... and in the real world, may not get so 
lucky in timing to be from one

You have it backwards. Ordinary operations _depend_ on gossip disseminating 
this promptly for correctness. We are simulating this having not happened yet, 
leading to an inconsistent view of the token ring and permitting operations to 
reach consensus without overlapping quorums, due to token ownership 
disagreements. Paxos operations (which must be linearizable) are now able to 
detect this inconsistency themselves and enforce it, by updating the gossip 
state directly.

Note that the {{Timeout}} is also a _valid outcome_ for this test, just a bad 
one for validating everything else is working correctly. The reason the 
{{Timeout}} occurs is that we explicitly drop messages to one of the _correct_ 
owners to ensure we can only succeed by contacting the _new_ owner, and if this 
race occurs we drop this message too, leaving only one correct owner that can 
respond.

bq. it looks like the other 'stale ring' tests are explicitly adding node4 
back, only this one does not.

Yes, they are testing other things.



> Test Failure: 
> org.apache.cassandra.distributed.test.CASTest.testConflictingWritesWithStaleRingInformation
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17461
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17461
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Andres de la Peña
>            Priority: Normal
>             Fix For: 4.1-beta, 4.x
>
>
> Intermittent failures on {{org.apache.cassandra.distributed.test.CASTest}} 
> for trunk:
> * 
> [testConflictingWritesWithStaleRingInformation|https://ci-cassandra.apache.org/job/Cassandra-trunk/1024/testReport/org.apache.cassandra.distributed.test/CASTest/testConflictingWritesWithStaleRingInformation_3/]
> * 
> [testSuccessfulWriteBeforeRangeMovement|https://ci-cassandra.apache.org/job/Cassandra-trunk/1025/testReport/org.apache.cassandra.distributed.test/CASTest/testSuccessfulWriteBeforeRangeMovement/]
> * 
> [testSuccessfulWriteDuringRangeMovementFollowedByConflicting|https://ci-cassandra.apache.org/job/Cassandra-trunk/1020/testReport/org.apache.cassandra.distributed.test/CASTest/testSuccessfulWriteDuringRangeMovementFollowedByConflicting/]
> * 
> [testSucccessfulWriteDuringRangeMovementFollowedByRead|https://ci-cassandra.apache.org/job/Cassandra-trunk/1020/testReport/org.apache.cassandra.distributed.test/CASTest/testSucccessfulWriteDuringRangeMovementFollowedByRead/]
> All four seem to have the same aspect:
> {code}
> Failed 2 times in the last 5 runs. Flakiness: 50%, Stability: 60%
> Error Message
> CAS operation timed out: received 1 of 2 required responses after 0 
> contention retries
> Stacktrace
> org.apache.cassandra.exceptions.CasWriteTimeoutException: CAS operation timed 
> out: received 1 of 2 required responses after 0 contention retries
>       at 
> org.apache.cassandra.service.paxos.Paxos$MaybeFailure.markAndThrowAsTimeoutOrFailure(Paxos.java:547)
>       at org.apache.cassandra.service.paxos.Paxos.begin(Paxos.java:1048)
>       at org.apache.cassandra.service.paxos.Paxos.cas(Paxos.java:659)
>       at org.apache.cassandra.service.paxos.Paxos.cas(Paxos.java:618)
>       at org.apache.cassandra.service.StorageProxy.cas(StorageProxy.java:307)
>       at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeWithCondition(ModificationStatement.java:500)
>       at 
> org.apache.cassandra.cql3.statements.ModificationStatement.execute(ModificationStatement.java:467)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:122)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:103)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:66)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Standard Output
> DEBUG [main] 2022-03-19 16:20:42,868 Reflections.java:198 - going to scan 
> these urls: 
> [jar:file:/home/cassandra/cassandra/build/apache-cassandra-4.1-SNAPSHOT.jar!/,
>  
> jar:file:/home/cassandra/cassandra/build/test/lib/jars/simulator-bootstrap.jar!/,
>  
> jar:file:/home/cassandra/cassandra/build/test/lib/jars/dtest-api-0.0.12.jar!/,
>  file:/home/cassandra/cassandra/build/classes/fqltool/, 
> file:/home/cassandra/cassandra/build/test/classes/, 
> file:/home/cassandra/cassandra/build/classes/main/, file:/home/cass
> ...[truncated 4929659 chars]...
> gService.java:519 - Waiting for messaging service to quiesce
> INFO  [node1_isolatedExecutor:10] 2022-03-19 16:21:55,223 
> SubstituteLogger.java:169 - INFO  [node1_isolatedExecutor:10] node1 
> 2022-03-19 16:21:55,221 MessagingService.java:519 - Waiting for messaging 
> service to quiesce
> INFO  [node2_isolatedExecutor:8] 2022-03-19 16:21:55,223 
> SubstituteLogger.java:169 - INFO  [node2_isolatedExecutor:8] node2 2022-03-19 
> 16:21:55,222 MessagingService.java:519 - Waiting for messaging service to 
> quiesce
> {code}
> Failures can also be repeatedly hit with CircleCI test multiplexer:
> [https://app.circleci.com/pipelines/github/adelapena/cassandra/1394/workflows/8d40d44b-7ccb-40fe-82d5-37db0bb228a3].
> The same test looks ok in 4.0, as suggested by Butler and [this repeated 
> Circle 
> run|https://app.circleci.com/pipelines/github/adelapena/cassandra/1395/workflows/5669dd1e-1a4c-4801-b1a1-c3ca04a29e2b].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-17461) Test Failure: org.apache.cassandra.distributed.test.CASTest.testConflictingWritesWithStaleRingInformation

Reply via email to