[ 
https://issues.apache.org/jira/browse/CASSANDRA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781055#action_12781055
 ] 

Stu Hood commented on CASSANDRA-562:
------------------------------------

> 2nd option (make sure only one operation is in process at one time for 
> affected ranges)
Once we're ready for the complexity, I think this is strongest option. It's 
effectively a two phase commit among the replicas that a bootstrapping node is 
trying to join.

> Make range changes more fully automatic
> ---------------------------------------
>
>                 Key: CASSANDRA-562
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-562
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jaakko Laine
>            Priority: Minor
>             Fix For: 0.5
>
>
> I think there are currently some problems with bootstrapping & leaving. The 
> inherent problem is that a node one-sidedly announces that it is going to 
> leave/take a token without making sure it will not cause conflict, and we do 
> not have proper mechanism to clean up after a conflict. There are currently 
> ways a simultaneous bootstrap or leaving can leave the cluster 
> (tokenmetadata) in an inconsistent state and we'll need either a mechanism to 
> resolve which operation wins or make sure only one operation is in process 
> for affected ranges at one time.
> 1st option (resolve & clean conflicts as they happen):
> We could add local timestamp to bootstrap/leave gossip and resolve conflicts 
> based on that. This would allow us to choose one of the operations 
> unambiguously and reject all but the one that was first. Theoretically, if 
> different data centers are not in perfect clock sync, this might always favor 
> one DC over the other in race situations, but this would hardly create any 
> noticeable bias. Problem is, this approach would probably end up being 
> horrendously complex (if not impossible) to do properly.
> 2nd option (make sure only one operation is in process at one time for 
> affected ranges):
> Add new messaging "channel" to be used to agree beforehand who is going to 
> move. (1) Before a node can start bootstrapping, it must ask permission from 
> the node whose range it is going to bootstrap to. The request will be 
> accepted if no other node is currently bootstrapping there. Nodes of course 
> check their own token metadata before bootstrapping, but this does not guard 
> against two nodes bootstrapping simultaneously (that is, before they see each 
> others' gossip). The only node able to answer this reliably is bootstrap 
> source. (2) For leaving, the node should first check with all nodes that are 
> going to have pending ranges if it is OK to leave. If it receives "OK" from 
> all of them, then it can leave. If bootstrapping or leaving operation is 
> rejected, the node will wait for random time and start over. Eventually all 
> nodes will be able to bootstrap.
> Good in this approach would be that with a relatively small change (adding 
> one messaging exchange before actual move) we could make sure that all 
> parties involved agree that the operation is OK. This is IMHO also the 
> "cleanest" way. Downside is that we will need token lease times (how long the 
> node owns "rights" to the token) and timers to make sure that we do not end 
> in a deadlock (or lock ranges) in case the node will not complete the 
> operation. Network partitions, delays and clock skews might create very 
> difficult border cases to handle.
> 3rd option (another approach to preventing conflicts from happening):
> (3) We might take a simplistic approach and add two new states: 
> ABOUT_TO_BOOTSTRAP and ABOUT_TO_LEAVE. Whenever a node wants to bootstrap or 
> leave, it will first gossip this state with token info and wait for some 
> time. After the wait, it will check if it is OK to carry on with the 
> operation (that is, it has not received any bootstrap/leaving gossip that 
> would contradict with its own plans). Also in this case, if the operation 
> cannot be done due to conflict, the node waits for random period and start 
> over.
> This would be easiest to implement, but it would not completely remove the 
> problem as network partitions and other delays might still cause two nodes to 
> clash without knowledge of each others' intentions. Also, adding two more 
> gossip states will expand the state machine and make it more fragile.
> Don't know if I'm thinking about this in a wrong way, but to me it seems 
> resolving conflicts is very difficult, so the only option is to avoid them by 
> some mechinism that makes sure only one node is moving within affected ranges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to