[
https://issues.apache.org/jira/browse/CASSANDRA-16139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202742#comment-17202742
]
Jeff Jirsa commented on CASSANDRA-16139:
----------------------------------------
If we're rewriting all of the membership/ownership code, we should definitely
be thinking about a world that isn't based on tokens and hash tables.
> Safe Ring Membership Protocol
> -----------------------------
>
> Key: CASSANDRA-16139
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16139
> Project: Cassandra
> Issue Type: Improvement
> Components: Cluster/Gossip, Cluster/Membership
> Reporter: Paulo Motta
> Assignee: Paulo Motta
> Priority: Normal
>
> This ticket presents a practical protocol for performing safe ring membership
> updates in Cassandra. This protocol will enable reliable concurrent ring
> membership updates.
> The proposed protocol is composed of the following macro-steps:
> *PROPOSE:* An initiator node wanting to make updates to the current ring
> structure (such as joining, leaving the ring or changing token assignments)
> must propose the change to the other members of the ring (cohort).
> *ACCEPT:* Upon receiving a proposal the other ring members determine if the
> change is compatible with their local version of the ring, and if so, they
> promise to accept the change proposed by the initiator. The ring members do
> not accept proposals if they had already promised to honor another proposal,
> to avoid conflicting ring membership updates.
> *COMMIT:* Once the initiator receives acceptances from all the nodes in the
> cohort, it commits the proposal by broadcasting the proposed ring delta via
> gossip. Upon receiving these changes, the other members of the cohort apply
> the delta to their local version of the ring and broadcast their new computed
> version via gossip. The initiator concludes the ring membership update
> operation by checking that all nodes agree on the new proposed version.
> *ABORT:* A proposal not accepted by all members of the cohort may be
> automatically aborted by the initiator or manually via a command line tool.
> For simplicity the protocol above requires that all nodes are up during the
> proposal step, but it should be possible to optimize it to require only a
> quorum of nodes up to perform ring changes.
> A python pseudo-code of the protocol is available
> [here|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-safe_ring_membership-py].
> With the abstraction above it becomes very simple to perform ring change
> operations:
> *
> [bootstrap|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-bootstrap-py]
> *
> [replace|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-replace-py]
> *
> [move|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-move-py]
> * [remove
> node|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_node-py]
> * [remove
> token|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_token-py]
> h4. Token Ring Data Structure
> The token ring data structure can be seen as a [Delta State Replicated Data
> Type|https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs]
> (Delta CRDT) containing the state of all (virtual) nodes in the cluster
> where updates to the ring are operations on this CRDT.
> Each member publishes its latest local accepted state (delta state) via
> gossip and the union of all delta states comprise the global ring state. The
> delta state must be commutative and idempotent to ensure all nodes will
> eventually reach the same global state no matter the order received.
> The content-addressed fingerprint of the global ring state uniquely
> identifies the ring version and provides a simple way to verify agreement
> between nodes in the cluster. Any change to the ring membership must be
> agreed using the described protocol, ensuring that both conditions are met:
> * All nodes have the same current view of the cluster before the update
> (verified via the ring version fingerprint).
> * All nodes have agreed to make the exact same update and not accept any
> other update before the current proposed update is committed or aborted.
> h4. Ring Convergence Time
> Assuming there are no network partitions, the ring membership convergence
> time will be dominated by the commit step since that is performed via gossip
> broadcast.
> The gossip broadcast is performed by sending the ring delta to the seed
> nodes, since other nodes will contact seed nodes with a #seeds / #nodes
> probability. This will define an upper bound for the maximum time it takes to
> propagate a ring update that was accepted by all members of the ring.
> On a cluster with 10% of the nodes as seeds, it’s guaranteed that a ring
> membership update operation will not take much longer than 10 seconds with
> the current gossip interval of 1 second. A simple way to reduce this upper
> bound is to make cohort acceptors gossip more frequently with seeds when
> there are pending ring membership updates.
> h4. Failure Modes
> - Concurrent Proposals:
> -- Concurrent initiators will not gather sufficient promises from all cohort
> members, and thus, will not succeed. Unsuccessful proposals may be cleaned up
> manually or automatically via the ABORT step.
> - Single proposal: Initiator Failure
> -- Initiator is partitioned before sending proposal
> --- Initiator will not gather sufficient promises from all cohort members,
> and thus, the ring membership update will not succeed.
> -- Initiator is partitioned after proposal is accepted by a subset of cohort
> members:
> --- Initiator will not gather sufficient promises from all cohort members.
> No other proposal will succeed before the partition is healed or it’s
> manually ABORTED.
> -- Initiator is partitioned after proposal is accepted by all cohort members
> --- No other proposal will succeed before the partition is healed or it’s
> manually ABORTED.
> Initiator is partitioned after proposal is committed
> --- If a single member of the cohort committed the proposal, all other
> members will eventually commit it since they will receive the update via
> gossip. No other proposal will be accepted before all nodes commit the
> current proposal.
> - Single proposal: Cohort member Failure
> -- Cohort member is partitioned before receiving proposal
> --- Initiator will not gather sufficient promises from all cohort members.
> No other proposal will succeed before the partition is healed because they
> will not be able to reach this cohort member.
> -- Cohort member is partitioned after accepting proposal
> --- If all other members of the cohort accepted the proposal, the initiator
> will COMMIT the proposal. No other proposal will succeed before the partition
> is healed because they will not be able to reach this cohort member.
> -- Cohort member is partitioned after committing proposal
> --- If a single member of the cohort committed the proposal, all other
> members will eventually commit it since they will receive the update via
> gossip from the initiator. No other proposal will be accepted before all
> nodes commit the current proposal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]