[
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616753#comment-13616753
]
Sylvain Lebresne commented on CASSANDRA-5062:
---------------------------------------------
There's a unclear part on the algorithm itself for which I'm not sure what is
the intent. During prepare, if a replica "promise", the ballot it sends back in
the response is the one it just got (the one the proposer sent). So in the
proposer (in SP.cas()), {{summary.inProgressBallot}} is necessary our own
{{ballot}} (unless it's a reject, but then we don't care anymore). Meaning,
that in SP.cas(), {{timeComparator.compare(summary.inProgressBallot,
summary.mostRecentCommitted) >= 0}} could be simplified in
{{timeComparator.compare(ballot, summary.mostRecentCommitted) >= 0}}. But it
also mean that in PrepareCallback, the inProgressUpdate we keep is pretty much
chosen randomly (since all {{inProgressBallot}} will in fact be equal to
{{ballot}}). Was that the intent? Or was the intent that during prepare, when
the replica promise, it returns the previous inProgressBalot, i.e. the one
before setting the new balot? (I think there might be a problem with both
choice but before getting to that I want to make sure what is the initial
intent).
Some other remarks while I'm at it:
* PaxosState.propose always return a true as first argument of PrepareResponse
(it always "promised").
* mostRecentCommitted doesn't seem to be ever set.
* I don't think the commit business work. Commit segments can be deleted at any
time due to flush, so I don't see how we can guarantee the persistency of the
paxos state. Furthermore, when we replay the commit log paxos entry, we don't
re-append them to the commit log, so if a node restart, play his log and
shutdown right away, it'll lost his paxos state too. Why not just use a System
table for the Paxos state? (I don't even think performance would be a big issue
because we can do queries by names that are relatively cheap and besides most
of the paxos state is deleted by commit, so the only part that will end up in
sstables is the mostRecentCommitted, but that's small and very very cacheable).
* I'm confused by FBUtilities.timeComparator. I'm not sure what is the intent
of using/comparing the clockSequence first, but I'm pretty sure this is broken.
Should that compare the timestamps of the UUIDs (The clock sequence is *not*
the timestamp). Furthermore, wasn't the goal to have a comparator to have it
only compare and timestamps (and thus not break tie on same timestamp)?
Lastly, wasn't the goal to reuse the ballot timestamp as the timestamp of the
columns in the update we finally commit (so that the column timestamps are
coherent with the order decided by Paxos)?
* Currently, the value returned by the cas method doesn't mean what it means
for CAS in general. Namely, a false might just mean that we've had one refusal
amongst the quorum first received responses, or that we've had to "replay" a
previous round first, and this irrelevant of whether our CAS applies or not. I
strongly believe we should return false only if the CAS doesn't apply, but
otherwise we should just restart a new proposal (probably after some small
random delay) until we are allowed to propose our value. Because otherwise:
** the behavior will be inintuitive since it differs from the usual behavior
** in almost all use case I can come up with, it will basically force users
to do a read every time the cas method return false, because you have to decide
whether your CAS indeed doesn't apply or something else.
** this leeks implementation details.
* It would be nice to add a comment on what problem the
{{inProgressBallot.equals(ballot)}} check in PaxosState.commit fixes.
* Can't we avoid FQRow? We can get back the keyspace from cf contained in the
row for instance (this wouldn't work if said cf was null, but we don't have
that case since it makes no sense to provide a null cf to SP.cas).
And two very small nits:
* In MessagingService, the comment on using padding should be moved before
UNUSED_1.
* In ProposeCallback, successful.addAndGet(1) -> sucessfull.incrementAndGet().
> Support CAS
> -----------
>
> Key: CASSANDRA-5062
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
> Project: Cassandra
> Issue Type: New Feature
> Components: API, Core
> Reporter: Jonathan Ellis
> Fix For: 2.0
>
> Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg,
> half-baked commit 3.jpg
>
>
> "Strong" consistency is not enough to prevent race conditions. The classic
> example is user account creation: we want to ensure usernames are unique, so
> we only want to signal account creation success if nobody else has created
> the account yet. But naive read-then-write allows clients to race and both
> think they have a green light to create.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira