[ 
https://issues.apache.org/jira/browse/CASSANDRA-16139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203474#comment-17203474
 ] 

Paulo Motta commented on CASSANDRA-16139:
-----------------------------------------

Thanks for your comments Benedict and Jeff! Please find follow-up below.
{quote}Any replacement should not be built upon Gossip (either in its current 
or an improved form)
{quote}
The proposed protocol uses gossip on 2 steps:
 a) before PROPOSE, to validate the initiator has the same ring version as the 
cohort;
 b) on COMMIT, to broadcast the ring membership update.

Step a) is an optimization that prevents the initiator from proposing a new 
ring version if there's a current disagreement. Step b) adds resilience against 
initiator failure during commit at the expense of latency, but can easily be 
made synchronous to address that.

I may be failing to see what's problematic about gossip here so I'll wait for 
your justification on why we should avoid it.
{quote}Being able to operate with a quorum is probably a lot harder than with 
every node's involvement, so I'd suggest thinking about that sooner than later
{quote}
That's a valid point. I will focus on making this work with all nodes for now, 
since that's a fair assumption/requirement, and if we see necessity we can get 
back to this later.
{quote}How do you guarantee that all participants in an operation have a 
consistent view of the ring for the purposes of that operation?
{quote}
content-based versioning. example:
 * Node A ring (version: *b710*):
{code:json}
{ "previous": "5f36", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C": 
["3:N", "7:N"]} }{code}

 * Node B ring (version: *b710*):
{code:json}
{ "previous": "5f36", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C": 
["3:N", "7:N"]} }{code}

 * Node C ring (version: *b710*):
{code:json}
{ "previous": "5f36", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C": 
["3:N", "7:N"]} }{code}

Suppose now nodes "D" and "E" want to join the ring with the same tokens "4" 
and "8" - only one of them should succeed.

Each of them will read the current ring version *b710*. Each node will generate 
the following "proposed" ring version:
 * Node D proposed ring (version: *6f69*):
{code:json}
{ "previous": "b710", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C": 
["3:N", "7:N"], "D": ["4:J", "8:J"]} } {code}

 * Node E proposed ring (version: *8f88*):
{code:json}
{ "previous": "b710", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C": 
["3:N", "7:N"], "E": ["4:J", "8:J"]} } {code}

They will then each send a PROPOSE message with the following parameters to the 
cohort:
 * NODE D:
{code:java}
PROPOSE(current_version="b710", proposed_version="6f69"){code}

 * NODE E:
{code:java}
PROPOSE(current_version="b710", proposed_version="8f88"){code}

In this situation it's possible that each of the 3 situations happen:
 * Neither E nor D gets a PROMISE from all nodes - no proposal succeeds
 * NODE D is able to get a promise from nodes A, B and C for version *6f69*.
 * NODE E is able to get a promise from nodes A, B and C for version *8f88*.

Now let's say NODE D proposal succeeds and the ring updates its version to 
*6f69*. Any proposal from NODE E referencing the previous ring version *b710* 
will be rejected by the cohort, so node E will be forced to update its version 
before submitting a new proposal.
{quote}If we're rewriting all of the membership/ownership code, we should 
definitely be thinking about a world that isn't based on tokens and hash tables.
{quote}
Would you care to elaborate why? My high level goal here is to ensure we can 
reliably add/remove/replace nodes to a cluster, and this seems to be reasonably 
doable with consistent hashing as far as I understand. I'd love to explore 
alternatives but I'd be interested in learning what requirements are not 
fulfilled by the current architecture.

> Safe Ring Membership Protocol
> -----------------------------
>
>                 Key: CASSANDRA-16139
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16139
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>
> This ticket presents a practical protocol for performing safe ring membership 
> updates in Cassandra. This protocol will enable reliable concurrent ring 
> membership updates.
> The proposed protocol is composed of the following macro-steps:
> *PROPOSE:* An initiator node wanting to make updates to the current ring 
> structure (such as joining, leaving the ring or changing token assignments) 
> must propose the change to the other members of the ring (cohort).
> *ACCEPT:* Upon receiving a proposal the other ring members determine if the 
> change is compatible with their local version of the ring, and if so, they 
> promise to accept the change proposed by the initiator. The ring members do 
> not accept proposals if they had already promised to honor another proposal, 
> to avoid conflicting ring membership updates.
> *COMMIT:* Once the initiator receives acceptances from all the nodes in the 
> cohort, it commits the proposal by broadcasting the proposed ring delta via 
> gossip. Upon receiving these changes, the other members of the cohort apply 
> the delta to their local version of the ring and broadcast their new computed 
> version via gossip. The initiator concludes the ring membership update 
> operation by checking that all nodes agree on the new proposed version.
> *ABORT:* A proposal not accepted by all members of the cohort may be 
> automatically aborted by the initiator or manually via a command line tool.
> For simplicity the protocol above requires that all nodes are up during the 
> proposal step, but it should be possible to optimize it to require only a 
> quorum of nodes up to perform ring changes.
> A python pseudo-code of the protocol is available 
> [here|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-safe_ring_membership-py].
> With the abstraction above it becomes very simple to perform ring change 
> operations:
>  * 
> [bootstrap|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-bootstrap-py]
>  * 
> [replace|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-replace-py]
>  * 
> [move|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-move-py]
>  * [remove 
> node|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_node-py]
>  * [remove 
> token|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_token-py]
> h4. Token Ring Data Structure
> The token ring data structure can be seen as a [Delta State Replicated Data 
> Type|https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs]
>  (Delta CRDT) containing the state of all (virtual) nodes in the cluster 
> where updates to the ring are operations on this CRDT.
> Each member publishes its latest local accepted state (delta state) via 
> gossip and the union of all delta states comprise the global ring state. The 
> delta state must be commutative and idempotent to ensure all nodes will 
> eventually reach the same global state no matter the order received.
> The content-addressed fingerprint of the global ring state uniquely 
> identifies the ring version and provides a simple way to verify agreement 
> between nodes in the cluster. Any change to the ring membership must be 
> agreed using the described protocol, ensuring that both conditions are met:
>  * All nodes have the same current view of the cluster before the update 
> (verified via the ring version fingerprint).
>  * All nodes have agreed to make the exact same update and not accept any 
> other update before the current proposed update is committed or aborted.
> h4. Ring Convergence Time
> Assuming there are no network partitions, the ring membership convergence 
> time will be dominated by the commit step since that is performed via gossip 
> broadcast.
> The gossip broadcast is performed by sending the ring delta to the seed 
> nodes, since other nodes will contact seed nodes with a #seeds / #nodes 
> probability. This will define an upper bound for the maximum time it takes to 
> propagate a ring update that was accepted by all members of the ring.
> On a cluster with 10% of the nodes as seeds, it’s guaranteed that a ring 
> membership update operation will not take much longer than 10 seconds with 
> the current gossip interval of 1 second. A simple way to reduce this upper 
> bound is to make cohort acceptors gossip more frequently with seeds when 
> there are pending ring membership updates.
> h4. Failure Modes
>  - Concurrent Proposals:
>  -- Concurrent initiators will not gather sufficient promises from all cohort 
> members, and thus, will not succeed. Unsuccessful proposals may be cleaned up 
> manually or automatically via the ABORT step.
>  - Single proposal: Initiator Failure
>  -- Initiator is partitioned before sending proposal
>  --- Initiator will not gather sufficient promises from all cohort members, 
> and thus, the ring membership update will not succeed.
>  -- Initiator is partitioned after proposal is accepted by a subset of cohort 
> members:
>  --- Initiator will not gather sufficient promises from all cohort members. 
> No other proposal will succeed before the partition is healed or it’s 
> manually ABORTED.
>  -- Initiator is partitioned after proposal is accepted by all cohort members
>  --- No other proposal will succeed before the partition is healed or it’s 
> manually ABORTED.
>  Initiator is partitioned after proposal is committed
>  --- If a single member of the cohort committed the proposal, all other 
> members will eventually commit it since they will receive the update via 
> gossip. No other proposal will be accepted before all nodes commit the 
> current proposal.
>  - Single proposal: Cohort member Failure
>  -- Cohort member is partitioned before receiving proposal
>  --- Initiator will not gather sufficient promises from all cohort members. 
> No other proposal will succeed before the partition is healed because they 
> will not be able to reach this cohort member.
>  -- Cohort member is partitioned after accepting proposal
>  --- If all other members of the cohort accepted the proposal, the initiator 
> will COMMIT the proposal. No other proposal will succeed before the partition 
> is healed because they will not be able to reach this cohort member.
>  -- Cohort member is partitioned after committing proposal
>  --- If a single member of the cohort committed the proposal, all other 
> members will eventually commit it since they will receive the update via 
> gossip from the initiator. No other proposal will be accepted before all 
> nodes commit the current proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to