[
https://issues.apache.org/jira/browse/CASSANDRA-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476301#comment-17476301
]
Benedict Elliott Smith commented on CASSANDRA-17164:
----------------------------------------------------
{quote}Is there any location where the algorithm is described in detail?
{quote}
The documentation that is present sort of assumes you are familiar with the
prior implementation of Paxos, and we pepper in justifications at each place
where a novel approach is taken. I will try to put together some overview
markdown documentation over the coming week. In the meantime:
{quote}the use of voting quorum that is not selected only among the replicas
that accepted a ballot
{quote}
I think this is actually the typical way of implementing Classic Paxos, even
though Lamport's paper seems to suggest you must only contact the nodes that
responded to the prepare (there may be something else specific about his
formulation that necessitates this, I forget, as I dislike his writings on the
topic). This is corroborated by [Heidi Howard's
dissertation|https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf], which
was the easiest place I could find a straight-forward formulation of Classic
Paxos besides that of Lamport. See Algorithm 3 on Page 30.
{quote}the use of "most recent commit" as a voting session identifier
{quote}
I don't quite follow what you mean by this, as this is not limited to "most
recent commit", but a ballot directly maps to the instance id of classic paxos,
it just avoids pre-splitting the range of integers.
{quote}the sharing of ballot numbers between sessions and rejection/acceptance
based solely on ballot numbers which may belong to a different voting session
{quote}
Could you explain what you are referring to here? I think this is all standard
stuff for Paxos, we're again just recording the most recently used instance
number for each register.
{quote}advancing voting sessions without committing empty proposals
{quote}
The final commit phase is only required to ensure any "decree" (decision) is
disseminated. If we have proposed that no decree be made, there is nothing to
disseminate, and nothing to complete if another transaction encounters it. This
is in some ways an artefact of the feature of Cassandra's implementation, that
we initiate a paxos round without knowing if it will do anything, though this
feature would I suppose be present for read-only operations anyway.
{quote}replicas skipping voting sessions because of stale participant refresh
{quote}
There's two possible things you mean by this. I think you are referring to the
situation where we send a commit and then continue with the ballot we have
already prepared? In which case I'm not sure this is really in conflict with
any formulation I've seen, which tends to gloss over handling of {_}commit{_},
and I think may arise solely from the particulars of Cassandra - we are not
updating a register, but are agreeing a delta, and only disseminate this to any
majority (that may be different from the one that received any prior delta),
and so we must ensure that each _Commit_ is witnessed by a majority so that the
complete register state may be constructed from any majority. In normal
formulations the register is overwritten, so I don't think the _Commit_ even
needs to be received if it is superseded by another {_}Commit{_}, and I think
many formulations ignore it entirely as a result.
Anyway to justify it seems pretty straightforward: if any other command were to
supersede us we would fail the _Accept_ phase, and if not then by updating the
_MostRecentCommit_ register we know precisely what the register state is on the
node, and it is equivalent to having received this response in the first place,
so we may proceed safely.
{quote}read vs write promises
{quote}
This is just a very simple formulation of operation commutativity. We linearise
writes with writes and reads, but we do not linearise reads with each other
since they are commutative. So any read operation only consults the write
registers, but updates the read registers, whereas writes update the write
registers and consult both.
{quote}the handling of range movements
{quote}
Fair, this is quite complex, and we should have already put in an overview
here. In simple terms, each node tracks those operations that have been
witnessed but are not known to have committed. Each node is able to coordinate
the completion of these operations, either by invalidating them, committing
them, or witnessing something newer. By performing this on a majority of nodes
we are able to ensure that all operations that may have reached a decision
prior to this mechanism being invoked are now committed to a majority of nodes
in their base table. By performing this after a node becomes pending but before
streaming begins we ensure that a new node was either already participating in
any operation and will be informed of it, or that it will receive its data via
bootstrap.
{quote}state expiration
{quote}
Using the same mechanism as described above, each range has a global lower
bound on ballots that are not known to have committed on a majority of nodes,
and will discount any incomplete operations with a lower ballot. Therefore the
data associated with these ballots can all be expunged. This requires regular
paxos repairs to be run, which can either occur as part of incremental /
regular repair, or be scheduled separately. In practice this means much faster
expiration, and that users whom enable this can use ANY commit consistency
level. We also need to provide some NEWS information explaining all of this.
Does that at least get you moving forward, while I work on a more comprehensive
overview?
> CEP-14: Paxos Improvements
> --------------------------
>
> Key: CASSANDRA-17164
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17164
> Project: Cassandra
> Issue Type: Improvement
> Components: Consistency/Coordination, Consistency/Repair
> Reporter: Benedict Elliott Smith
> Assignee: Benedict Elliott Smith
> Priority: Normal
> Fix For: 4.1
>
>
> This ticket encompasses work for [CEP-14|
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements].
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]