[
https://issues.apache.org/jira/browse/CASSANDRA-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344221#comment-15344221
]
Sylvain Lebresne commented on CASSANDRA-12043:
----------------------------------------------
Yes, definitively sounds like something that will happen.
bq. One of the potential fixes could be to set the TTL based on the remaining
time left on another replicas
That's an option. Though if I'm not mistaken that means shipping that TTL with
the commit message, which can't be easily done without changing the internode
protocol (and I suspect we'd rather not wait for 4.0 to fix that).
Another option is, in {{beginAndRepairPaxos}}, to ignore the most recent commit
(and just proceed) if it's older than the TTL we set on the paxos table, which
is pretty simple.
But this does bring up a related problem: even if you ignore commits being
replayed (i.e. even if you have no failures and all nodes gets the commits
right away), you still have a potential windows in which new proposal might not
be able to make progress due to clock skews (and general lack of perfect
think). Indeed, currently every replica uses a local timestamp to decide if
something is expired when reading the paxos state, so just after a state
expired, if it is read, some node may return a MRC and other won't, and no
amount of committing back that MRC will help (same reason than above, the node
for which the MRC is expired still have a tombstone), and this until all nodes
consider the MRC expired.
Granted, assuming a decent NTP, that window shouldn't be too big, but it would
also be nice to remove it, and I don't think that's terribly hard. All we need
is to make sure all replicas use the same timestamp when deciding if a state is
expired during a prepare. And we don't even have to send anything new with the
prepare message since we have the ballot of the new proposal which contains the
current time.
So anyway, I'm attaching below 2 commits that:
# ignore the MRC if it's older than whatever TTL we set on the paxos table
# ensure we use the same "stable" timestamp (for expiration sake) across all
nodes when reading the paxos state and for the ignoring above.
The patch is against 2.1 and I need to merge up (it won't merge cleanly at
least in 3.0) but I'll wait on the initial test results before doing that, and
I also want to find some time to write some dtests for those. So not marking
patch available just yet, but the general approach can be reviewed/discussed.
|| [2.1|https://github.com/pcmanus/cassandra/commits/12043-2.1] ||
[utests|http://cassci.datastax.com/job/pcmanus-12043-2.1-testall/] ||
[dtests|http://cassci.datastax.com/job/pcmanus-12043-2.1-dtest/] ||
> Syncing most recent commit in CAS across replicas can cause all CAS queries
> in the CQL partition to fail
> --------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-12043
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12043
> Project: Cassandra
> Issue Type: Bug
> Reporter: sankalp kohli
>
> We update the most recent commit on requiredParticipant replicas if out of
> sync during the prepare round in beginAndRepairPaxos method. We keep doing
> this in a loop till the requiredParticipant replicas have the same most
> recent commit or we hit timeout.
> Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do
> a CAS write at time 0 and it went to A and B but not to C. C will get the
> hint later but will not update the most recent commit in paxos table. This is
> how CAS hints work.
> In the paxos table whose gc_grace=0, most_recent_commit in A and B will be
> inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this
> insert will become a tombstone at time 0 till it is compacted away since
> gc_grace=0.
> Do a CAS read after say 1 day on the same CQL partition and this time prepare
> phase involved A and C. most_recent_commit on C for this CQL partition is
> empty. A sends the most_recent_commit to C with a timestamp of 0 and with a
> TTL of 10 days. This most_recent_commit on C will expire on 11th day since it
> is inserted after 1 day.
> most_recent_commit are now in sync on A,B and C, however A and B
> most_recent_commit will expire on 10th day whereas for C it will expire on
> 11th day since it was inserted one day later.
> Do another CAS read after 10days when most_recent_commit on A and B have
> expired and is treated as tombstones till compacted. In this CAS read, say A
> and C are involved in prepare phase. most_recent_commit will not match
> between them since it is expired in A and is still there on C. This will
> cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of
> 10 days. If A has not compacted away the original most_recent_commit which
> has expired, this new write to most_recent_commit wont be visible on reads
> since there is a tombstone with same timestamp(Delete wins over data with
> same timestamp).
> Another round of prepare will follow and again A would say it does not know
> about most_recent_write(covered by original write which is not a tombstone)
> and C will again try to send the write to A. This can keep going on till the
> request timeouts or only A and B are involved in the prepare phase.
> When A’s original most_recent_commit which is now a tombstone is compacted,
> all the inserts which it was covering will come live. This will in turn again
> get played to another replica. This ping pong can keep going on for a long
> time.
> The issue is that most_recent_commit is expiring at different times across
> replicas. When they get replayed to a replica to make it in sync, we again
> set the TTL from that point.
> During the CAS read which timed out, most_recent_commit was being sent to
> another replica in a loop. Even in successful requests, it will try to loop
> for a couple of times if involving A and C and then when the replicas which
> respond are A and B, it will succeed. So this will have impact on latencies
> as well.
> These timeouts gets worse when a machine is down as no progress can be made
> as the machine with unexpired commit is always involved in the CAS prepare
> round. Also with range movements, the new machine gaining range has empty
> most recent commit and gets the commit at a later time causing same issue.
> Repro steps:
> 1. Paxos TTL is max(3 hours, gc_grace) as defined in
> SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3
> hours.
> Method SystemKeyspace.paxosTtl() will look like return
> metadata.getGcGraceSeconds(); instead of return Math.max(3 * 3600,
> metadata.getGcGraceSeconds());
> We are doing this so that we dont need to wait for 3 hours.
> Create a 3 node cluster with the code change suggested above with machines
> A,B and C
> CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy',
> 'replication_factor' : 3 };
> use test;
> CREATE TABLE users (a int PRIMARY KEY,b int);
> alter table users WITH gc_grace_seconds=120;
> consistency QUORUM;
> bring down machine C
> INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
> Nodetool flush on machine A and B
> Bring up the down machine B
> consistency SERIAL;
> tracing on;
> wait 80 seconds
> Bring up machine C
> select * from users where user_name = 1;
> Wait 40 seconds
> select * from users where user_name = 1; //All queries from this point
> forward will timeout.
> One of the potential fixes could be to set the TTL based on the remaining
> time left on another replicas. This will be TTL-timestamp of write. This
> timestamp is calculated from ballot which uses server time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)