[
https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728069#comment-14728069
]
Jason Brown commented on CASSANDRA-9667:
----------------------------------------
updated manual join protocol:
We require two "consensus transactions"(1) to become a full token-owner in the
cluster: one transaction to become a member, and another to take ownership of
some tokens(2). So, the protocol would look like this:
# new node comes up, and performs consensus transaction to become a member.
Will probably need to contact a seed to perform transaction, and should happen
automatically, without operator intervention.
## transaction adds new node to the member set, as well as incrementing a
logical clock(2)
## commit the transaction (retry on failure)
## updated member set/logical clock is gossiped - perhaps also directly
message the joining node, as well (sending it the total member set and
metadata).
## node is allowed to participate in gossip after becoming a member.
# Repeat for each new node to be added to the cluster.
# operator can inoke 'nodetool show-pending-nodes' (or whatever we call the
command) to see a plan of the current nodes waiting to be assigned tokens (and
become part of the ring). Operator can confirm that everything looks as
expected.
# Operator invokes 'nodetool joinall' to start the next consensus transaction,
to make new nodes owners of tokens:
## proposer of transaction auto-generates tokens for nodes that did not declare
any (like a node replace),
## updates the member set with tokens for each node, and increments the logical
clock
## commit the transaction (retry on failure)
## updated member set/increments is gossiped - perhaps also directly message
the transitioning nodes, as well.
## transitioning nodes can change their status themselves, and start
bootstrapping (if necessary)
Note: if we generate token in the ownership step (as I think we should, for
optiminzing the token selection), then we cannot show the 'pending ranges/size
to transfer' in the 'nodetool show-pending-nodes' command output (as we don't
know all the nodes/tokens yet) as [~tjake] proposed. However, we might be able
to get away with displaying it after the owner consensus transaction is
complete. Alternatively, we could run the same alg that generates the tokens to
'predict' what the tokens will look like, and display that as a potentially
non-exact (but close enough) approximation of what the cluster will look like
before executing the transaction.
1) I'm not completely sure of using LWT for the necessary linearizability
operation (for safe cluster state transitions), so I'll use the stand-in
"consensus transaction" for now.
2) Tokens may be assigned manually by an operator, derived from another node
that is being replaced, or auto-generated by the a proposer of the consensus
transaction (think proposer in Paxos).
3) Logical clock - an monotonic counter that indicates the linearized changes
to the member set (either adding/removing nodes, or changes to tokens of a
member). Basically, an integer that is incremented on every modification.
> strongly consistent membership and ownership
> --------------------------------------------
>
> Key: CASSANDRA-9667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9667
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jason Brown
> Assignee: Jason Brown
> Labels: LWT, membership, ownership
> Fix For: 3.x
>
>
> Currently, there is advice to users to "wait two minutes between adding new
> nodes" in order for new node tokens, et al, to propagate. Further, as there's
> no coordination amongst joining node wrt token selection, new nodes can end
> up selecting ranges that overlap with other joining nodes. This causes a lot
> of duplicate streaming from the existing source nodes as they shovel out the
> bootstrap data for those new nodes.
> This ticket proposes creating a mechanism that allows strongly consistent
> membership and ownership changes in cassandra such that changes are performed
> in a linearizable and safe manner. The basic idea is to use LWT operations
> over a global system table, and leverage the linearizability of LWT for
> ensuring the safety of cluster membership/ownership state changes. This work
> is inspired by Riak's claimant module.
> The existing workflows for node join, decommission, remove, replace, and
> range move (there may be others I'm not thinking of) will need to be modified
> to participate in this scheme, as well as changes to nodetool to enable them.
> Note: we distinguish between membership and ownership in the following ways:
> for membership we mean "a host in this cluster and it's state". For
> ownership, we mean "what tokens (or ranges) does each node own"; these nodes
> must already be a member to be assigned tokens.
> A rough draft sketch of how the 'add new node' workflow might look like is:
> new nodes would no longer create tokens themselves, but instead contact a
> member of a Paxos cohort (via a seed). The cohort member will generate the
> tokens and execute a LWT transaction, ensuring a linearizable change to the
> membership/ownership state. The updated state will then be disseminated via
> the existing gossip.
> As for joining specifically, I think we could support two modes: auto-mode
> and manual-mode. Auto-mode is for adding a single new node per LWT operation,
> and would require no operator intervention (much like today). In manual-mode,
> however, multiple new nodes could (somehow) signal their their intent to join
> to the cluster, but will wait until an operator executes a nodetool command
> that will trigger the token generation and LWT operation for all pending new
> nodes. This will allow us better range partitioning and will make the
> bootstrap streaming more efficient as we won't have overlapping range
> requests.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)