[ 
https://issues.apache.org/jira/browse/CASSANDRA-16139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202742#comment-17202742
 ] 

Jeff Jirsa commented on CASSANDRA-16139:
----------------------------------------

If we're rewriting all of the membership/ownership code, we should definitely 
be thinking about a world that isn't based on tokens and hash tables.



> Safe Ring Membership Protocol
> -----------------------------
>
>                 Key: CASSANDRA-16139
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16139
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>
> This ticket presents a practical protocol for performing safe ring membership 
> updates in Cassandra. This protocol will enable reliable concurrent ring 
> membership updates.
> The proposed protocol is composed of the following macro-steps:
> *PROPOSE:* An initiator node wanting to make updates to the current ring 
> structure (such as joining, leaving the ring or changing token assignments) 
> must propose the change to the other members of the ring (cohort).
> *ACCEPT:* Upon receiving a proposal the other ring members determine if the 
> change is compatible with their local version of the ring, and if so, they 
> promise to accept the change proposed by the initiator. The ring members do 
> not accept proposals if they had already promised to honor another proposal, 
> to avoid conflicting ring membership updates.
> *COMMIT:* Once the initiator receives acceptances from all the nodes in the 
> cohort, it commits the proposal by broadcasting the proposed ring delta via 
> gossip. Upon receiving these changes, the other members of the cohort apply 
> the delta to their local version of the ring and broadcast their new computed 
> version via gossip. The initiator concludes the ring membership update 
> operation by checking that all nodes agree on the new proposed version.
> *ABORT:* A proposal not accepted by all members of the cohort may be 
> automatically aborted by the initiator or manually via a command line tool.
> For simplicity the protocol above requires that all nodes are up during the 
> proposal step, but it should be possible to optimize it to require only a 
> quorum of nodes up to perform ring changes.
> A python pseudo-code of the protocol is available 
> [here|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-safe_ring_membership-py].
> With the abstraction above it becomes very simple to perform ring change 
> operations:
>  * 
> [bootstrap|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-bootstrap-py]
>  * 
> [replace|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-replace-py]
>  * 
> [move|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-move-py]
>  * [remove 
> node|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_node-py]
>  * [remove 
> token|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_token-py]
> h4. Token Ring Data Structure
> The token ring data structure can be seen as a [Delta State Replicated Data 
> Type|https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs]
>  (Delta CRDT) containing the state of all (virtual) nodes in the cluster 
> where updates to the ring are operations on this CRDT.
> Each member publishes its latest local accepted state (delta state) via 
> gossip and the union of all delta states comprise the global ring state. The 
> delta state must be commutative and idempotent to ensure all nodes will 
> eventually reach the same global state no matter the order received.
> The content-addressed fingerprint of the global ring state uniquely 
> identifies the ring version and provides a simple way to verify agreement 
> between nodes in the cluster. Any change to the ring membership must be 
> agreed using the described protocol, ensuring that both conditions are met:
>  * All nodes have the same current view of the cluster before the update 
> (verified via the ring version fingerprint).
>  * All nodes have agreed to make the exact same update and not accept any 
> other update before the current proposed update is committed or aborted.
> h4. Ring Convergence Time
> Assuming there are no network partitions, the ring membership convergence 
> time will be dominated by the commit step since that is performed via gossip 
> broadcast.
> The gossip broadcast is performed by sending the ring delta to the seed 
> nodes, since other nodes will contact seed nodes with a #seeds / #nodes 
> probability. This will define an upper bound for the maximum time it takes to 
> propagate a ring update that was accepted by all members of the ring.
> On a cluster with 10% of the nodes as seeds, it’s guaranteed that a ring 
> membership update operation will not take much longer than 10 seconds with 
> the current gossip interval of 1 second. A simple way to reduce this upper 
> bound is to make cohort acceptors gossip more frequently with seeds when 
> there are pending ring membership updates.
> h4. Failure Modes
>  - Concurrent Proposals:
>  -- Concurrent initiators will not gather sufficient promises from all cohort 
> members, and thus, will not succeed. Unsuccessful proposals may be cleaned up 
> manually or automatically via the ABORT step.
>  - Single proposal: Initiator Failure
>  -- Initiator is partitioned before sending proposal
>  --- Initiator will not gather sufficient promises from all cohort members, 
> and thus, the ring membership update will not succeed.
>  -- Initiator is partitioned after proposal is accepted by a subset of cohort 
> members:
>  --- Initiator will not gather sufficient promises from all cohort members. 
> No other proposal will succeed before the partition is healed or it’s 
> manually ABORTED.
>  -- Initiator is partitioned after proposal is accepted by all cohort members
>  --- No other proposal will succeed before the partition is healed or it’s 
> manually ABORTED.
>  Initiator is partitioned after proposal is committed
>  --- If a single member of the cohort committed the proposal, all other 
> members will eventually commit it since they will receive the update via 
> gossip. No other proposal will be accepted before all nodes commit the 
> current proposal.
>  - Single proposal: Cohort member Failure
>  -- Cohort member is partitioned before receiving proposal
>  --- Initiator will not gather sufficient promises from all cohort members. 
> No other proposal will succeed before the partition is healed because they 
> will not be able to reach this cohort member.
>  -- Cohort member is partitioned after accepting proposal
>  --- If all other members of the cohort accepted the proposal, the initiator 
> will COMMIT the proposal. No other proposal will succeed before the partition 
> is healed because they will not be able to reach this cohort member.
>  -- Cohort member is partitioned after committing proposal
>  --- If a single member of the cohort committed the proposal, all other 
> members will eventually commit it since they will receive the update via 
> gossip from the initiator. No other proposal will be accepted before all 
> nodes commit the current proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to