[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616753#comment-13616753
 ] 

Sylvain Lebresne commented on CASSANDRA-5062:
---------------------------------------------

There's a unclear part on the algorithm itself for which I'm not sure what is 
the intent. During prepare, if a replica "promise", the ballot it sends back in 
the response is the one it just got (the one the proposer sent). So in the 
proposer (in SP.cas()), {{summary.inProgressBallot}} is necessary our own 
{{ballot}} (unless it's a reject, but then we don't care anymore).  Meaning, 
that in SP.cas(), {{timeComparator.compare(summary.inProgressBallot, 
summary.mostRecentCommitted) >= 0}} could be simplified in 
{{timeComparator.compare(ballot, summary.mostRecentCommitted) >= 0}}.  But it 
also mean that in PrepareCallback, the inProgressUpdate we keep is pretty much 
chosen randomly (since all {{inProgressBallot}} will in fact be equal to 
{{ballot}}). Was that the intent? Or was the intent that during prepare, when 
the replica promise, it returns the previous inProgressBalot, i.e. the one 
before setting the new balot? (I think there might be a problem with both 
choice but before getting to that I want to make sure what is the initial 
intent).

Some other remarks while I'm at it:
* PaxosState.propose always return a true as first argument of PrepareResponse 
(it always "promised").
* mostRecentCommitted doesn't seem to be ever set.
* I don't think the commit business work. Commit segments can be deleted at any 
time due to flush, so I don't see how we can guarantee the persistency of the 
paxos state. Furthermore, when we replay the commit log paxos entry, we don't 
re-append them to the commit log, so if a node restart, play his log and 
shutdown right away, it'll lost his paxos state too. Why not just use a System 
table for the Paxos state? (I don't even think performance would be a big issue 
because we can do queries by names that are relatively cheap and besides most 
of the paxos state is deleted by commit, so the only part that will end up in 
sstables is the mostRecentCommitted, but that's small and very very cacheable).
* I'm confused by FBUtilities.timeComparator. I'm not sure what is the intent 
of using/comparing the clockSequence first, but I'm pretty sure this is broken. 
 Should that compare the timestamps of the UUIDs (The clock sequence is *not* 
the timestamp). Furthermore, wasn't the goal to have a comparator to have it 
only compare and timestamps (and thus not break tie on same timestamp)?  
Lastly, wasn't the goal to reuse the ballot timestamp as the timestamp of the 
columns in the update we finally commit (so that the column timestamps are 
coherent with the order decided by Paxos)?
* Currently, the value returned by the cas method doesn't mean what it means 
for CAS in general. Namely, a false might just mean that we've had one refusal 
amongst the quorum first received responses, or that we've had to "replay" a 
previous round first, and this irrelevant of whether our CAS applies or not.  I 
strongly believe we should return false only if the CAS doesn't apply, but 
otherwise we should just restart a new proposal (probably after some small 
random delay) until we are allowed to propose our value. Because otherwise:
  ** the behavior will be inintuitive since it differs from the usual behavior
  ** in almost all use case I can come up with, it will basically force users 
to do a read every time the cas method return false, because you have to decide 
whether your CAS indeed doesn't apply or something else.
  ** this leeks implementation details.
* It would be nice to add a comment on what problem the 
{{inProgressBallot.equals(ballot)}} check in PaxosState.commit fixes.
* Can't we avoid FQRow? We can get back the keyspace from cf contained in the 
row for instance (this wouldn't work if said cf was null, but we don't have 
that case since it makes no sense to provide a null cf to SP.cas).

And two very small nits:
* In MessagingService, the comment on using padding should be moved before 
UNUSED_1.
* In ProposeCallback, successful.addAndGet(1) -> sucessfull.incrementAndGet().

                
> Support CAS
> -----------
>
>                 Key: CASSANDRA-5062
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 2.0
>
>         Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
> half-baked commit 3.jpg
>
>
> "Strong" consistency is not enough to prevent race conditions.  The classic 
> example is user account creation: we want to ensure usernames are unique, so 
> we only want to signal account creation success if nobody else has created 
> the account yet.  But naive read-then-write allows clients to race and both 
> think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to