[
https://issues.apache.org/jira/browse/CASSANDRA-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347327#comment-15347327
]
sankalp kohli commented on CASSANDRA-12043:
-------------------------------------------
Please commit this in 2.1.15 +
> Syncing most recent commit in CAS across replicas can cause all CAS queries
> in the CQL partition to fail
> --------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-12043
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12043
> Project: Cassandra
> Issue Type: Bug
> Reporter: sankalp kohli
> Assignee: Sylvain Lebresne
>
> We update the most recent commit on requiredParticipant replicas if out of
> sync during the prepare round in beginAndRepairPaxos method. We keep doing
> this in a loop till the requiredParticipant replicas have the same most
> recent commit or we hit timeout.
> Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do
> a CAS write at time 0 and it went to A and B but not to C. C will get the
> hint later but will not update the most recent commit in paxos table. This is
> how CAS hints work.
> In the paxos table whose gc_grace=0, most_recent_commit in A and B will be
> inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this
> insert will become a tombstone at time 0 till it is compacted away since
> gc_grace=0.
> Do a CAS read after say 1 day on the same CQL partition and this time prepare
> phase involved A and C. most_recent_commit on C for this CQL partition is
> empty. A sends the most_recent_commit to C with a timestamp of 0 and with a
> TTL of 10 days. This most_recent_commit on C will expire on 11th day since it
> is inserted after 1 day.
> most_recent_commit are now in sync on A,B and C, however A and B
> most_recent_commit will expire on 10th day whereas for C it will expire on
> 11th day since it was inserted one day later.
> Do another CAS read after 10days when most_recent_commit on A and B have
> expired and is treated as tombstones till compacted. In this CAS read, say A
> and C are involved in prepare phase. most_recent_commit will not match
> between them since it is expired in A and is still there on C. This will
> cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of
> 10 days. If A has not compacted away the original most_recent_commit which
> has expired, this new write to most_recent_commit wont be visible on reads
> since there is a tombstone with same timestamp(Delete wins over data with
> same timestamp).
> Another round of prepare will follow and again A would say it does not know
> about most_recent_write(covered by original write which is not a tombstone)
> and C will again try to send the write to A. This can keep going on till the
> request timeouts or only A and B are involved in the prepare phase.
> When A’s original most_recent_commit which is now a tombstone is compacted,
> all the inserts which it was covering will come live. This will in turn again
> get played to another replica. This ping pong can keep going on for a long
> time.
> The issue is that most_recent_commit is expiring at different times across
> replicas. When they get replayed to a replica to make it in sync, we again
> set the TTL from that point.
> During the CAS read which timed out, most_recent_commit was being sent to
> another replica in a loop. Even in successful requests, it will try to loop
> for a couple of times if involving A and C and then when the replicas which
> respond are A and B, it will succeed. So this will have impact on latencies
> as well.
> These timeouts gets worse when a machine is down as no progress can be made
> as the machine with unexpired commit is always involved in the CAS prepare
> round. Also with range movements, the new machine gaining range has empty
> most recent commit and gets the commit at a later time causing same issue.
> Repro steps:
> 1. Paxos TTL is max(3 hours, gc_grace) as defined in
> SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3
> hours.
> Method SystemKeyspace.paxosTtl() will look like return
> metadata.getGcGraceSeconds(); instead of return Math.max(3 * 3600,
> metadata.getGcGraceSeconds());
> We are doing this so that we dont need to wait for 3 hours.
> Create a 3 node cluster with the code change suggested above with machines
> A,B and C
> CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy',
> 'replication_factor' : 3 };
> use test;
> CREATE TABLE users (a int PRIMARY KEY,b int);
> alter table users WITH gc_grace_seconds=120;
> consistency QUORUM;
> bring down machine C
> INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
> Nodetool flush on machine A and B
> Bring up the down machine B
> consistency SERIAL;
> tracing on;
> wait 80 seconds
> Bring up machine C
> select * from users where user_name = 1;
> Wait 40 seconds
> select * from users where user_name = 1; //All queries from this point
> forward will timeout.
> One of the potential fixes could be to set the TTL based on the remaining
> time left on another replicas. This will be TTL-timestamp of write. This
> timestamp is calculated from ballot which uses server time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)