Handle pending range clash gracefully
-------------------------------------

                 Key: CASSANDRA-562
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-562
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.5
            Reporter: Jaakko Laine
             Fix For: 0.5


I think there are currently some problems with bootstrapping & leaving. The 
inherent problem is that a node one-sidedly announces that it is going to 
leave/take a token without making sure it will not cause conflict, and we do 
not have proper mechanism to clean up after a conflict. There are currently 
ways a simultaneous bootstrap or leaving can leave the cluster (tokenmetadata) 
in an inconsistent state and we'll need either a mechanism to resolve which 
operation wins or make sure only one operation is in process for affected 
ranges at one time.

1st option (resolve & clean conflicts as they happen):

We could add local timestamp to bootstrap/leave gossip and resolve conflicts 
based on that. This would allow us to choose one of the operations 
unambiguously and reject all but the one that was first. Theoretically, if 
different data centers are not in perfect clock sync, this might always favor 
one DC over the other in race situations, but this would hardly create any 
noticeable bias. Problem is, this approach would probably end up being 
horrendously complex (if not impossible) to do properly.

2nd option (make sure only one operation is in process at one time for affected 
ranges):

Add new messaging "channel" to be used to agree beforehand who is going to 
move. (1) Before a node can start bootstrapping, it must ask permission from 
the node whose range it is going to bootstrap to. The request will be accepted 
if no other node is currently bootstrapping there. Nodes of course check their 
own token metadata before bootstrapping, but this does not guard against two 
nodes bootstrapping simultaneously (that is, before they see each others' 
gossip). The only node able to answer this reliably is bootstrap source. (2) 
For leaving, the node should first check with all nodes that are going to have 
pending ranges if it is OK to leave. If it receives "OK" from all of them, then 
it can leave. If bootstrapping or leaving operation is rejected, the node will 
wait for random time and start over. Eventually all nodes will be able to 
bootstrap.

Good in this approach would be that with a relatively small change (adding one 
messaging exchange before actual move) we could make sure that all parties 
involved agree that the operation is OK. This is IMHO also the "cleanest" way. 
Downside is that we will need token lease times (how long the node owns 
"rights" to the token) and timers to make sure that we do not end in a deadlock 
(or lock ranges) in case the node will not complete the operation. Network 
partitions, delays and clock skews might create very difficult border cases to 
handle.

3rd option (another approach to preventing conflicts from happening):

(3) We might take a simplistic approach and add two new states: 
ABOUT_TO_BOOTSTRAP and ABOUT_TO_LEAVE. Whenever a node wants to bootstrap or 
leave, it will first gossip this state with token info and wait for some time. 
After the wait, it will check if it is OK to carry on with the operation (that 
is, it has not received any bootstrap/leaving gossip that would contradict with 
its own plans). Also in this case, if the operation cannot be done due to 
conflict, the node waits for random period and start over.

This would be easiest to implement, but it would not completely remove the 
problem as network partitions and other delays might still cause two nodes to 
clash without knowledge of each others' intentions. Also, adding two more 
gossip states will expand the state machine and make it more fragile.


Don't know if I'm thinking about this in a wrong way, but to me it seems 
resolving conflicts is very difficult, so the only option is to avoid them by 
some mechinism that makes sure only one node is moving within affected ranges.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to