[ https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joel Knighton updated CASSANDRA-9667: ------------------------------------- Assignee: (was: Joel Knighton) > strongly consistent membership and ownership > -------------------------------------------- > > Key: CASSANDRA-9667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9667 > Project: Cassandra > Issue Type: New Feature > Reporter: Jason Brown > Labels: LWT, membership, ownership > Fix For: 3.x > > > Currently, there is advice to users to "wait two minutes between adding new > nodes" in order for new node tokens, et al, to propagate. Further, as there's > no coordination amongst joining node wrt token selection, new nodes can end > up selecting ranges that overlap with other joining nodes. This causes a lot > of duplicate streaming from the existing source nodes as they shovel out the > bootstrap data for those new nodes. > This ticket proposes creating a mechanism that allows strongly consistent > membership and ownership changes in cassandra such that changes are performed > in a linearizable and safe manner. The basic idea is to use LWT operations > over a global system table, and leverage the linearizability of LWT for > ensuring the safety of cluster membership/ownership state changes. This work > is inspired by Riak's claimant module. > The existing workflows for node join, decommission, remove, replace, and > range move (there may be others I'm not thinking of) will need to be modified > to participate in this scheme, as well as changes to nodetool to enable them. > Note: we distinguish between membership and ownership in the following ways: > for membership we mean "a host in this cluster and it's state". For > ownership, we mean "what tokens (or ranges) does each node own"; these nodes > must already be a member to be assigned tokens. > A rough draft sketch of how the 'add new node' workflow might look like is: > new nodes would no longer create tokens themselves, but instead contact a > member of a Paxos cohort (via a seed). The cohort member will generate the > tokens and execute a LWT transaction, ensuring a linearizable change to the > membership/ownership state. The updated state will then be disseminated via > the existing gossip. > As for joining specifically, I think we could support two modes: auto-mode > and manual-mode. Auto-mode is for adding a single new node per LWT operation, > and would require no operator intervention (much like today). In manual-mode, > however, multiple new nodes could (somehow) signal their their intent to join > to the cluster, but will wait until an operator executes a nodetool command > that will trigger the token generation and LWT operation for all pending new > nodes. This will allow us better range partitioning and will make the > bootstrap streaming more efficient as we won't have overlapping range > requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)