[ 
https://issues.apache.org/jira/browse/CASSANDRA-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479167#comment-17479167
 ] 

Branimir Lambov commented on CASSANDRA-17164:
---------------------------------------------

bq. I think this is actually the typical way of implementing Classic Paxos, 
even though Lamport's paper seems to suggest you must only contact the nodes 
that responded to the prepare (there may be something else specific about his 
formulation that necessitates this, I forget, as I dislike his writings on the 
topic). This is corroborated by Heidi Howard's dissertation, which was the 
easiest place I could find a straight-forward formulation of Classic Paxos 
besides that of Lamport. See Algorithm 3 on Page 30.

This is an excellent pointer, thank you. Lamport's proof also works with 
separate proposal quorum -- my only request here is that this should be stated 
somewhere as other contributors might start with Lamport's original definition 
too.

bq. I don't quite follow what you mean by this, as this is not limited to "most 
recent commit", but a ballot directly maps to the instance id of classic paxos, 
it just avoids pre-splitting the range of integers.

Well, I don't follow this one. A ballot id is something one replica selects and 
sends it as a prepare message to everyone. At this point some nodes may have 
committed a proposal, others may have accepted it, and yet others may have 
never heard of it. From the point of view of a sequence of paxos instances, the 
first two will definitely make promises for different instances, and likely the 
third one will be promising for yet another one. The fact that we tie accepting 
a promise to a majority with matching "most recent commit" means that we 
effectively treat it as the identifier of the current session.

More interestingly, we can then send a proposal to nodes with older most recent 
commit (i.e. nodes that are effectively working on a previous paxos instance), 
that proposal will be accepted (overwriting the decided value but not advancing 
the most recent commit), and this might lead to a decision using just a small 
number of participants with the right "most recent commit". Specifically, what 
stops the following from happening?

{code}
                       A mrc    A accepted    A promised             B mrc   B 
accepted   B promised             C mrc    C accepted   C promised

prepare(1) -> ABC        -           -            1                    -        
   -           1                   -          -            1
propose(1, X) -> ABC     -         1,X            1                    -        
 1,X           1                   -         1,X           1
commit(1, X) -> AB      1,X          -            1                   1,X       
   -           1                   -          -            1

prepare(2) -> AB        1,X          -            2                   1,X       
   -           2                   -         1,X           1                    
  majority AB, no refresh
propose(2, Y) -> BC     1,X          -            2                   1,X       
 2,Y           2                   -         2,Y           2                    
  returns successful Y

prepare(3) -> AC        1,X        1,X            3                   1,X       
 2,Y           2                  1,X         -            3                    
  refresh stale, then forms majority AC
propose(3, Z) -> ABC    1,X        3,Z            3                   1,X       
 3,Z           3                  1,X        3,Z           3                    
  returns successful Z   
{code}

If this is permitted, there's a further error possibility if B is partitioned 
after the 2,Y proposal succeeds and a commit is executed in B. Then the 
committed value can be read from B, and later wiped with a conflicting 
decision. 

In LWT terms, if 
X: {{column = X if not exists}}, 
Y: {{column = Y if column == X}} and 
Z: {{column = Z if column == X}},
we may be able to read both Y and Z.

bq. Could you explain what you are referring to here? I think this is all 
standard stuff for Paxos, we're again just recording the most recently used 
instance number for each register.

Judging from the above, this may be insufficient. I'm likely missing something 
here -- that something needs to be documented.

bq. The final commit phase is only required to ensure any "decree" (decision) 
is disseminated. If we have proposed that no decree be made, there is nothing 
to disseminate, and nothing to complete if another transaction encounters it. 
This is in some ways an artefact of the feature of Cassandra's implementation, 
that we initiate a paxos round without knowing if it will do anything, though 
this feature would I suppose be present for read-only operations anyway.

This is also hard to read from the code: {{Paxos.cas}} does not issue a commit 
for an empty proposal, but it seems the next {{prepare}} will return 
{{FOUND_INCOMPLETE_ACCEPTED}} and trigger reproposal and commit. This will work 
correctly, but perform one roundtrip more than simply committing when the 
agreement was reached. What am I missing?

> CEP-14: Paxos Improvements
> --------------------------
>
>                 Key: CASSANDRA-17164
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17164
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Consistency/Repair
>            Reporter: Benedict Elliott Smith
>            Assignee: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.1
>
>
> This ticket encompasses work for [CEP-14|
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to