[ 
https://issues.apache.org/jira/browse/CASSANDRA-17461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578494#comment-17578494
 ] 

Benedict Elliott Smith commented on CASSANDRA-17461:
----------------------------------------------------

Thanks Ekaterina, that makes clear what the problem is. The issue is that when 
node1 receives an update to its ring information about node4, it first contacts 
node4 to confirm it is alive before marking it so, and this is racing with the 
submission of the second round of Paxos Prepares so that the outgoing message 
to node4 is being dropped before it hits the wire.

I have introduced some additional logic to ensure that node4 is marked alive 
before the second round of prepares is submitted

https://github.com/belliottsmith/cassandra/tree/17461-4.1

> Test Failure: 
> org.apache.cassandra.distributed.test.CASTest.testConflictingWritesWithStaleRingInformation
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17461
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17461
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Andres de la Peña
>            Priority: Normal
>             Fix For: 4.1-beta, 4.x
>
>
> Intermittent failures on {{org.apache.cassandra.distributed.test.CASTest}} 
> for trunk:
> * 
> [testConflictingWritesWithStaleRingInformation|https://ci-cassandra.apache.org/job/Cassandra-trunk/1024/testReport/org.apache.cassandra.distributed.test/CASTest/testConflictingWritesWithStaleRingInformation_3/]
> * 
> [testSuccessfulWriteBeforeRangeMovement|https://ci-cassandra.apache.org/job/Cassandra-trunk/1025/testReport/org.apache.cassandra.distributed.test/CASTest/testSuccessfulWriteBeforeRangeMovement/]
> * 
> [testSuccessfulWriteDuringRangeMovementFollowedByConflicting|https://ci-cassandra.apache.org/job/Cassandra-trunk/1020/testReport/org.apache.cassandra.distributed.test/CASTest/testSuccessfulWriteDuringRangeMovementFollowedByConflicting/]
> * 
> [testSucccessfulWriteDuringRangeMovementFollowedByRead|https://ci-cassandra.apache.org/job/Cassandra-trunk/1020/testReport/org.apache.cassandra.distributed.test/CASTest/testSucccessfulWriteDuringRangeMovementFollowedByRead/]
> All four seem to have the same aspect:
> {code}
> Failed 2 times in the last 5 runs. Flakiness: 50%, Stability: 60%
> Error Message
> CAS operation timed out: received 1 of 2 required responses after 0 
> contention retries
> Stacktrace
> org.apache.cassandra.exceptions.CasWriteTimeoutException: CAS operation timed 
> out: received 1 of 2 required responses after 0 contention retries
>       at 
> org.apache.cassandra.service.paxos.Paxos$MaybeFailure.markAndThrowAsTimeoutOrFailure(Paxos.java:547)
>       at org.apache.cassandra.service.paxos.Paxos.begin(Paxos.java:1048)
>       at org.apache.cassandra.service.paxos.Paxos.cas(Paxos.java:659)
>       at org.apache.cassandra.service.paxos.Paxos.cas(Paxos.java:618)
>       at org.apache.cassandra.service.StorageProxy.cas(StorageProxy.java:307)
>       at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeWithCondition(ModificationStatement.java:500)
>       at 
> org.apache.cassandra.cql3.statements.ModificationStatement.execute(ModificationStatement.java:467)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:122)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:103)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:66)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Standard Output
> DEBUG [main] 2022-03-19 16:20:42,868 Reflections.java:198 - going to scan 
> these urls: 
> [jar:file:/home/cassandra/cassandra/build/apache-cassandra-4.1-SNAPSHOT.jar!/,
>  
> jar:file:/home/cassandra/cassandra/build/test/lib/jars/simulator-bootstrap.jar!/,
>  
> jar:file:/home/cassandra/cassandra/build/test/lib/jars/dtest-api-0.0.12.jar!/,
>  file:/home/cassandra/cassandra/build/classes/fqltool/, 
> file:/home/cassandra/cassandra/build/test/classes/, 
> file:/home/cassandra/cassandra/build/classes/main/, file:/home/cass
> ...[truncated 4929659 chars]...
> gService.java:519 - Waiting for messaging service to quiesce
> INFO  [node1_isolatedExecutor:10] 2022-03-19 16:21:55,223 
> SubstituteLogger.java:169 - INFO  [node1_isolatedExecutor:10] node1 
> 2022-03-19 16:21:55,221 MessagingService.java:519 - Waiting for messaging 
> service to quiesce
> INFO  [node2_isolatedExecutor:8] 2022-03-19 16:21:55,223 
> SubstituteLogger.java:169 - INFO  [node2_isolatedExecutor:8] node2 2022-03-19 
> 16:21:55,222 MessagingService.java:519 - Waiting for messaging service to 
> quiesce
> {code}
> Failures can also be repeatedly hit with CircleCI test multiplexer:
> [https://app.circleci.com/pipelines/github/adelapena/cassandra/1394/workflows/8d40d44b-7ccb-40fe-82d5-37db0bb228a3].
> The same test looks ok in 4.0, as suggested by Butler and [this repeated 
> Circle 
> run|https://app.circleci.com/pipelines/github/adelapena/cassandra/1395/workflows/5669dd1e-1a4c-4801-b1a1-c3ca04a29e2b].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to