[
https://issues.apache.org/jira/browse/CASSANDRA-16139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203474#comment-17203474
]
Paulo Motta commented on CASSANDRA-16139:
-----------------------------------------
Thanks for your comments Benedict and Jeff! Please find follow-up below.
{quote}Any replacement should not be built upon Gossip (either in its current
or an improved form)
{quote}
The proposed protocol uses gossip on 2 steps:
a) before PROPOSE, to validate the initiator has the same ring version as the
cohort;
b) on COMMIT, to broadcast the ring membership update.
Step a) is an optimization that prevents the initiator from proposing a new
ring version if there's a current disagreement. Step b) adds resilience against
initiator failure during commit at the expense of latency, but can easily be
made synchronous to address that.
I may be failing to see what's problematic about gossip here so I'll wait for
your justification on why we should avoid it.
{quote}Being able to operate with a quorum is probably a lot harder than with
every node's involvement, so I'd suggest thinking about that sooner than later
{quote}
That's a valid point. I will focus on making this work with all nodes for now,
since that's a fair assumption/requirement, and if we see necessity we can get
back to this later.
{quote}How do you guarantee that all participants in an operation have a
consistent view of the ring for the purposes of that operation?
{quote}
content-based versioning. example:
* Node A ring (version: *b710*):
{code:json}
{ "previous": "5f36", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C":
["3:N", "7:N"]} }{code}
* Node B ring (version: *b710*):
{code:json}
{ "previous": "5f36", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C":
["3:N", "7:N"]} }{code}
* Node C ring (version: *b710*):
{code:json}
{ "previous": "5f36", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C":
["3:N", "7:N"]} }{code}
Suppose now nodes "D" and "E" want to join the ring with the same tokens "4"
and "8" - only one of them should succeed.
Each of them will read the current ring version *b710*. Each node will generate
the following "proposed" ring version:
* Node D proposed ring (version: *6f69*):
{code:json}
{ "previous": "b710", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C":
["3:N", "7:N"], "D": ["4:J", "8:J"]} } {code}
* Node E proposed ring (version: *8f88*):
{code:json}
{ "previous": "b710", "vnodes": {"A": ["1:N", "5:N"], "B": ["2:N", "6:N", "C":
["3:N", "7:N"], "E": ["4:J", "8:J"]} } {code}
They will then each send a PROPOSE message with the following parameters to the
cohort:
* NODE D:
{code:java}
PROPOSE(current_version="b710", proposed_version="6f69"){code}
* NODE E:
{code:java}
PROPOSE(current_version="b710", proposed_version="8f88"){code}
In this situation it's possible that each of the 3 situations happen:
* Neither E nor D gets a PROMISE from all nodes - no proposal succeeds
* NODE D is able to get a promise from nodes A, B and C for version *6f69*.
* NODE E is able to get a promise from nodes A, B and C for version *8f88*.
Now let's say NODE D proposal succeeds and the ring updates its version to
*6f69*. Any proposal from NODE E referencing the previous ring version *b710*
will be rejected by the cohort, so node E will be forced to update its version
before submitting a new proposal.
{quote}If we're rewriting all of the membership/ownership code, we should
definitely be thinking about a world that isn't based on tokens and hash tables.
{quote}
Would you care to elaborate why? My high level goal here is to ensure we can
reliably add/remove/replace nodes to a cluster, and this seems to be reasonably
doable with consistent hashing as far as I understand. I'd love to explore
alternatives but I'd be interested in learning what requirements are not
fulfilled by the current architecture.
> Safe Ring Membership Protocol
> -----------------------------
>
> Key: CASSANDRA-16139
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16139
> Project: Cassandra
> Issue Type: Improvement
> Components: Cluster/Gossip, Cluster/Membership
> Reporter: Paulo Motta
> Assignee: Paulo Motta
> Priority: Normal
>
> This ticket presents a practical protocol for performing safe ring membership
> updates in Cassandra. This protocol will enable reliable concurrent ring
> membership updates.
> The proposed protocol is composed of the following macro-steps:
> *PROPOSE:* An initiator node wanting to make updates to the current ring
> structure (such as joining, leaving the ring or changing token assignments)
> must propose the change to the other members of the ring (cohort).
> *ACCEPT:* Upon receiving a proposal the other ring members determine if the
> change is compatible with their local version of the ring, and if so, they
> promise to accept the change proposed by the initiator. The ring members do
> not accept proposals if they had already promised to honor another proposal,
> to avoid conflicting ring membership updates.
> *COMMIT:* Once the initiator receives acceptances from all the nodes in the
> cohort, it commits the proposal by broadcasting the proposed ring delta via
> gossip. Upon receiving these changes, the other members of the cohort apply
> the delta to their local version of the ring and broadcast their new computed
> version via gossip. The initiator concludes the ring membership update
> operation by checking that all nodes agree on the new proposed version.
> *ABORT:* A proposal not accepted by all members of the cohort may be
> automatically aborted by the initiator or manually via a command line tool.
> For simplicity the protocol above requires that all nodes are up during the
> proposal step, but it should be possible to optimize it to require only a
> quorum of nodes up to perform ring changes.
> A python pseudo-code of the protocol is available
> [here|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-safe_ring_membership-py].
> With the abstraction above it becomes very simple to perform ring change
> operations:
> *
> [bootstrap|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-bootstrap-py]
> *
> [replace|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-replace-py]
> *
> [move|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-move-py]
> * [remove
> node|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_node-py]
> * [remove
> token|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_token-py]
> h4. Token Ring Data Structure
> The token ring data structure can be seen as a [Delta State Replicated Data
> Type|https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs]
> (Delta CRDT) containing the state of all (virtual) nodes in the cluster
> where updates to the ring are operations on this CRDT.
> Each member publishes its latest local accepted state (delta state) via
> gossip and the union of all delta states comprise the global ring state. The
> delta state must be commutative and idempotent to ensure all nodes will
> eventually reach the same global state no matter the order received.
> The content-addressed fingerprint of the global ring state uniquely
> identifies the ring version and provides a simple way to verify agreement
> between nodes in the cluster. Any change to the ring membership must be
> agreed using the described protocol, ensuring that both conditions are met:
> * All nodes have the same current view of the cluster before the update
> (verified via the ring version fingerprint).
> * All nodes have agreed to make the exact same update and not accept any
> other update before the current proposed update is committed or aborted.
> h4. Ring Convergence Time
> Assuming there are no network partitions, the ring membership convergence
> time will be dominated by the commit step since that is performed via gossip
> broadcast.
> The gossip broadcast is performed by sending the ring delta to the seed
> nodes, since other nodes will contact seed nodes with a #seeds / #nodes
> probability. This will define an upper bound for the maximum time it takes to
> propagate a ring update that was accepted by all members of the ring.
> On a cluster with 10% of the nodes as seeds, it’s guaranteed that a ring
> membership update operation will not take much longer than 10 seconds with
> the current gossip interval of 1 second. A simple way to reduce this upper
> bound is to make cohort acceptors gossip more frequently with seeds when
> there are pending ring membership updates.
> h4. Failure Modes
> - Concurrent Proposals:
> -- Concurrent initiators will not gather sufficient promises from all cohort
> members, and thus, will not succeed. Unsuccessful proposals may be cleaned up
> manually or automatically via the ABORT step.
> - Single proposal: Initiator Failure
> -- Initiator is partitioned before sending proposal
> --- Initiator will not gather sufficient promises from all cohort members,
> and thus, the ring membership update will not succeed.
> -- Initiator is partitioned after proposal is accepted by a subset of cohort
> members:
> --- Initiator will not gather sufficient promises from all cohort members.
> No other proposal will succeed before the partition is healed or it’s
> manually ABORTED.
> -- Initiator is partitioned after proposal is accepted by all cohort members
> --- No other proposal will succeed before the partition is healed or it’s
> manually ABORTED.
> Initiator is partitioned after proposal is committed
> --- If a single member of the cohort committed the proposal, all other
> members will eventually commit it since they will receive the update via
> gossip. No other proposal will be accepted before all nodes commit the
> current proposal.
> - Single proposal: Cohort member Failure
> -- Cohort member is partitioned before receiving proposal
> --- Initiator will not gather sufficient promises from all cohort members.
> No other proposal will succeed before the partition is healed because they
> will not be able to reach this cohort member.
> -- Cohort member is partitioned after accepting proposal
> --- If all other members of the cohort accepted the proposal, the initiator
> will COMMIT the proposal. No other proposal will succeed before the partition
> is healed because they will not be able to reach this cohort member.
> -- Cohort member is partitioned after committing proposal
> --- If a single member of the cohort committed the proposal, all other
> members will eventually commit it since they will receive the update via
> gossip from the initiator. No other proposal will be accepted before all
> nodes commit the current proposal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]