[
https://issues.apache.org/jira/browse/CASSANDRA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779999#action_12779999
]
Jonathan Ellis commented on CASSANDRA-562:
------------------------------------------
Right, so to make more explicit the limitations here:
- operator has to let gossip from previous bootstrap finish before starting
another one (this is only a few seconds -- much less stringent than "only one
node at a time")
- if you are adding M nodes to a ring of N where M > N, you can either add N
and wait for those to complete before adding the rest, or manually assign
tokens to M - N since the autoguess won't work after the first N.
> Handle pending range clash gracefully
> -------------------------------------
>
> Key: CASSANDRA-562
> URL: https://issues.apache.org/jira/browse/CASSANDRA-562
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.5
> Reporter: Jaakko Laine
> Fix For: 0.5
>
>
> I think there are currently some problems with bootstrapping & leaving. The
> inherent problem is that a node one-sidedly announces that it is going to
> leave/take a token without making sure it will not cause conflict, and we do
> not have proper mechanism to clean up after a conflict. There are currently
> ways a simultaneous bootstrap or leaving can leave the cluster
> (tokenmetadata) in an inconsistent state and we'll need either a mechanism to
> resolve which operation wins or make sure only one operation is in process
> for affected ranges at one time.
> 1st option (resolve & clean conflicts as they happen):
> We could add local timestamp to bootstrap/leave gossip and resolve conflicts
> based on that. This would allow us to choose one of the operations
> unambiguously and reject all but the one that was first. Theoretically, if
> different data centers are not in perfect clock sync, this might always favor
> one DC over the other in race situations, but this would hardly create any
> noticeable bias. Problem is, this approach would probably end up being
> horrendously complex (if not impossible) to do properly.
> 2nd option (make sure only one operation is in process at one time for
> affected ranges):
> Add new messaging "channel" to be used to agree beforehand who is going to
> move. (1) Before a node can start bootstrapping, it must ask permission from
> the node whose range it is going to bootstrap to. The request will be
> accepted if no other node is currently bootstrapping there. Nodes of course
> check their own token metadata before bootstrapping, but this does not guard
> against two nodes bootstrapping simultaneously (that is, before they see each
> others' gossip). The only node able to answer this reliably is bootstrap
> source. (2) For leaving, the node should first check with all nodes that are
> going to have pending ranges if it is OK to leave. If it receives "OK" from
> all of them, then it can leave. If bootstrapping or leaving operation is
> rejected, the node will wait for random time and start over. Eventually all
> nodes will be able to bootstrap.
> Good in this approach would be that with a relatively small change (adding
> one messaging exchange before actual move) we could make sure that all
> parties involved agree that the operation is OK. This is IMHO also the
> "cleanest" way. Downside is that we will need token lease times (how long the
> node owns "rights" to the token) and timers to make sure that we do not end
> in a deadlock (or lock ranges) in case the node will not complete the
> operation. Network partitions, delays and clock skews might create very
> difficult border cases to handle.
> 3rd option (another approach to preventing conflicts from happening):
> (3) We might take a simplistic approach and add two new states:
> ABOUT_TO_BOOTSTRAP and ABOUT_TO_LEAVE. Whenever a node wants to bootstrap or
> leave, it will first gossip this state with token info and wait for some
> time. After the wait, it will check if it is OK to carry on with the
> operation (that is, it has not received any bootstrap/leaving gossip that
> would contradict with its own plans). Also in this case, if the operation
> cannot be done due to conflict, the node waits for random period and start
> over.
> This would be easiest to implement, but it would not completely remove the
> problem as network partitions and other delays might still cause two nodes to
> clash without knowledge of each others' intentions. Also, adding two more
> gossip states will expand the state machine and make it more fragile.
> Don't know if I'm thinking about this in a wrong way, but to me it seems
> resolving conflicts is very difficult, so the only option is to avoid them by
> some mechinism that makes sure only one node is moving within affected ranges.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.