David Capwell created CASSANDRA-18422:
-----------------------------------------

             Summary: CEP-15 (Accord) Original and recover coordinators may hit 
a race condition with PreApply where reads and writes are interleaved, causing 
one of the coordinators to see the writes from the other
                 Key: CASSANDRA-18422
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18422
             Project: Cassandra
          Issue Type: Bug
          Components: Accord
            Reporter: David Capwell
            Assignee: David Capwell


While verifying CASSANDRA-18364 I saw the following history violation in 
simulator

{code}
[junit-timeout] Testcase: 
simulationTest(org.apache.cassandra.simulator.test.ShortAccordSimulationTest)-.jdk1.8:
        Caused an ERROR
[junit-timeout] Failed on seed 0xadaca81151490353
[junit-timeout] org.apache.cassandra.simulator.SimulationException: Failed on 
seed 0xadaca81151490353
[junit-timeout] Caused by: java.lang.AssertionError: History violations detected
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.PaxosSimulation.logAndThrow(PaxosSimulation.java:315)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.PaxosSimulation.isDone(PaxosSimulation.java:278)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.PaxosSimulation$2.hasNext(PaxosSimulation.java:249)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.PaxosSimulation.run(PaxosSimulation.java:224)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.AbstractPairOfSequencesPaxosSimulation.run(AbstractPairOfSequencesPaxosSimulation.java:297)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation.run(PairOfSequencesAccordSimulation.java:62)
[junit-timeout]         at 
org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:374)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.AccordSimulationRunner$Run.run(AccordSimulationRunner.java:39)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.AccordSimulationRunner$Run.run(AccordSimulationRunner.java:30)
[junit-timeout]         at 
org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:355)
[junit-timeout]         at 
org.apache.cassandra.simulator.paxos.AccordSimulationRunner.main(AccordSimulationRunner.java:76)
[junit-timeout]         at 
org.apache.cassandra.simulator.test.ShortAccordSimulationTest.simulationTest(ShortAccordSimulationTest.java:32)
[junit-timeout]         Suppressed: 
org.apache.cassandra.simulator.paxos.HistoryViolation: Inconsistent sequences 
on 1: [2, 0, 1, 6, 8, 9, 13, 14, 16, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, 
34, 35, 37, 40, 43, 47, 48, 49, 54, 56, 57, 58, 60, 64, 68, 70, 71, 74, 76, 79, 
80, 83, 85, 87, 87] vs [2, 0, 1, 6, 8, 9, 13, 14, 16, 19, 20, 22, 23, 25, 26, 
28, 29, 31, 32, 34, 35, 37, 40, 43, 47, 48, 49, 54, 56, 57, 58, 60, 64, 68, 70, 
71, 74, 76, 79, 80, 83, 85, 87]+90
[junit-timeout]                 at 
accord.verify.StrictSerializabilityVerifier$Register.updateSequence(StrictSerializabilityVerifier.java:607)
[junit-timeout]                 at 
accord.verify.StrictSerializabilityVerifier$Register.access$100(StrictSerializabilityVerifier.java:576)
[junit-timeout]                 at 
accord.verify.StrictSerializabilityVerifier.apply(StrictSerializabilityVerifier.java:825)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator$1.lambda$close$0(StrictSerializabilityValidator.java:66)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator.convertHistoryViolation(StrictSerializabilityValidator.java:89)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator.access$200(StrictSerializabilityValidator.java:27)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator$1.close(StrictSerializabilityValidator.java:66)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.LoggingHistoryValidator$1.close(LoggingHistoryValidator.java:63)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation$ReadWriteOperation.verify(PairOfSequencesAccordSimulation.java:218)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.PaxosSimulation$Operation.accept(PaxosSimulation.java:135)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation$ReadWriteOperation.accept(PairOfSequencesAccordSimulation.java:171)
[junit-timeout]                 at 
org.apache.cassandra.simulator.paxos.PaxosSimulation$Operation.accept(PaxosSimulation.java:83)
[junit-timeout]                 at 
org.apache.cassandra.simulator.systems.SimulatedActionCallable$1.run(SimulatedActionCallable.java:47)
[junit-timeout]                 at 
org.apache.cassandra.simulator.systems.InterceptingExecutor$InterceptingPooledExecutor$WaitingThread.lambda$new$1(InterceptingExecutor.java:317)
[junit-timeout]                 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout]                 at java.lang.Thread.run(Thread.java:750)
{code}

Adding logging to track message passing, reads, and writes, I have the 
following ordering

{code}
[isolatedExecutor:3]  node3 2023-04-03 12:54:30,200 send(/127.0.0.1:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ))
[isolatedExecutor:3]  node3 2023-04-03 12:54:30,200 send(/127.0.0.2:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ))
[isolatedExecutor:3]  node3 2023-04-03 12:54:30,200 send(/127.0.0.3:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ))
[CommandStore[2]:1]   node1 2023-04-03 12:54:30,208 CS:[2] OP:0xea64a268 
reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE 
verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE 
verb:ACCORD_PREACCEPT_RSP))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,209 CS:[2] OP:0x9761fb36 
reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE 
verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE 
verb:ACCORD_PREACCEPT_RSP))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,210 CS:[2] OP:0xe62230f2 
reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE 
verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE 
verb:ACCORD_PREACCEPT_RSP))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,211 CS:[2] OP:0xc5563e5d 
send(/127.0.0.1:7012, (from:/127.0.0.3:7012, type:IMMEDIATE 
verb:ACCORD_COMMIT_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,212 CS:[2] OP:0xc5563e5d 
send(/127.0.0.2:7012, (from:/127.0.0.3:7012, type:IMMEDIATE 
verb:ACCORD_COMMIT_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,212 CS:[2] OP:0xc5563e5d 
send(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE 
verb:ACCORD_COMMIT_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 
send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_RECOVER_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 
send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_RECOVER_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 
send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_RECOVER_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,334 CS:[2] OP:0xf8562cfb 
reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_RECOVER_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE 
verb:ACCORD_RECOVER_RSP))
[CommandStore[2]:1]   node1 2023-04-03 12:54:30,338 CS:[2] OP:0xcfd2540f 
reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_RECOVER_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE 
verb:ACCORD_RECOVER_RSP))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,338 CS:[2] OP:0xc4cf5af8 
reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_RECOVER_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE 
verb:ACCORD_RECOVER_RSP))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 
send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_GET_DEPS_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 
send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_GET_DEPS_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 
send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_GET_DEPS_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,343 CS:[2] OP:0xb60153ab 
reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE 
verb:ACCORD_GET_DEPS_RSP))
[CommandStore[2]:1]   node1 2023-04-03 12:54:30,344 CS:[2] OP:0xac20a1d6 
reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE 
verb:ACCORD_GET_DEPS_RSP))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,345 CS:[2] OP:0xa73f7484 
reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE 
verb:ACCORD_GET_DEPS_RSP))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a 
send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_COMMIT_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a 
send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_COMMIT_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a 
send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE 
verb:ACCORD_COMMIT_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,349 CS:[2] OP:0xff574276 
Performing read
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,349 CS:[2] OP:0xff574276 
Performing read
[ReadStage:1]         node3 2023-04-03 12:54:30,351 Performing read; post
[ReadStage:1]         node3 2023-04-03 12:54:30,351 Performing read; post
[ReadStage:1]         node3 2023-04-03 12:54:30,351 Performing read; pre
[ReadStage:1]         node3 2023-04-03 12:54:30,351 Performing read; pre
[ReadStage:1]         node3 2023-04-03 12:54:30,351 reply(/127.0.0.2:7012, 
(from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ), 
(from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_READ_RSP))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,359 Performing coordinated write
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,359 send(/127.0.0.1:7012, 
(from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,359 send(/127.0.0.2:7012, 
(from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ))
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,360 send(/127.0.0.3:7012, 
(from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,363 CS:[2] OP:0x8bdb6795 
Performing read
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,363 CS:[2] OP:0x8bdb6795 
Performing read
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,364 CS:[2] OP:0x92e94460 
Performing write
[MutationStage:4]     node3 2023-04-03 12:54:30,364 Performing write: pre
[MutationStage:4]     node3 2023-04-03 12:54:30,365 Performing write: post
[ReadStage:1]         node3 2023-04-03 12:54:30,365 Performing read; post
[ReadStage:1]         node3 2023-04-03 12:54:30,365 Performing read; pre
[ReadStage:1]         node3 2023-04-03 12:54:30,369 Performing read; post
[ReadStage:1]         node3 2023-04-03 12:54:30,369 Performing read; pre
[ReadStage:1]         node3 2023-04-03 12:54:30,369 reply(/127.0.0.3:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ), 
(from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_READ_RSP))
[CommandStore[2]:1]   node1 2023-04-03 12:54:30,370 CS:[2] OP:0xa59dc286 
Performing write
[CommandStore[2]:1]   node2 2023-04-03 12:54:30,374 CS:[2] OP:0xab0f3ca4 
Performing write
[MutationStage:1]     node2 2023-04-03 12:54:30,374 Performing write: pre
[MutationStage:1]     node2 2023-04-03 12:54:30,375 Performing write: post
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,376 Performing coordinated write
[MutationStage:3]     node1 2023-04-03 12:54:30,376 Performing write: pre
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,377 send(/127.0.0.1:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,377 send(/127.0.0.2:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,377 send(/127.0.0.3:7012, 
(from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ))
[CommandStore[2]:1]   node3 2023-04-03 12:54:30,382 reply(/127.0.0.2:7012, 
(from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ), 
(from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_APPLY_RSP))
[MutationStage:3]     node1 2023-04-03 12:54:30,382 Performing write: post
{code}

(The transaction has a returning select and an auto-read, which is why there 
are double logs for reads)

Here we see the following timing

{code}
T00 node3 starts txn
T01 node3 sends COMMIT
T02 node2 starts recover
T03 all nodes ack to the pending recover
T04 node2 sends COMMIT
T05 node3 performs reads needed for txn
T06 node3 sends read results to node2
T07 node2 performs write locally and send APPLY
T08 node3 performs write
T09 node3 performs reads needed for txn
T10 node3 send reads to node3
T11 node3 performs write and sends APPLY
T12 node3 ACKs APPLY to node2
{code}

Given the fact the simulator got a response back and didn’t get a preempt, this 
implies that the original coordinator was able to complete the full transaction 
without issues and reply back, but the reads/writes were interleaved between 
node3 and node2 causing the second write to observe the first write



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to