[jira] Commented: (CASSANDRA-562) Handle pending range clash gracefully

Jaakko Laine (JIRA) Thu, 19 Nov 2009 06:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779995#action_12779995
 ]


Jaakko Laine commented on CASSANDRA-562:
----------------------------------------

Might be that I'm thinking about this in too complex way, but (currently) if 
two nodes boot simultaneously, there are at least following things that could 
go wrong:

(1) Bootstrapping nodes both get a bootstrapping token from the same most 
loaded node (they have not yet received each others' gossip, so they think 
nobody is booting there). Most likely this token will be identical, as 
bootstrap source does not consider pending ranges.

(2) Both booting nodes get the token, and gossip that they are going to boot to 
this token. All other nodes in the cluster will update their pending range 
according to one of these, depending on which gossip they got first.

(3) Token metadata might be left in an inconsistent state, as addPendingRange 
is not atomic. The clash might be caused by, say, the third range to be added, 
in which case the 1st and 2nd range will be left there even though they're not 
supposed to.

(4) Since the bootstrapping nodes do not update pending ranges for themselves, 
they will not notice the range clash, and will happily continue to assume the 
token they were given. In this case both nodes will think that they own the 
token and token metadata on other nodes will list either one of these nodes as 
token owner depending on which gossip they heard firts.


These are not all related to simultaneous bootstrap, just listed here all 
points that I recalled regarding bootstrap. Many of these can be taken care of 
with some care even without any "new" mechanism before bootstrapping, but does 
cleaning solve all issues?

What if nodes boot1 and boot2 boot from source1 simultaneously? source1 needs 
to either reject one of them or give different tokens. First one could be 
easily taken care of by getBootstrapToken (just return a "null token", which 
would signal the booting node to try again after some time). Second case would 
complicate things on pending range side, as we would need to calculate them 
according to other pending ranges, not only based on tokens. On the other hand, 
if too large pending ranges are not a problem (that is, we're streaming data to 
a destination that is not going to need it), then things are OK I think.


> Handle pending range clash gracefully
> -------------------------------------
>
>                 Key: CASSANDRA-562
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-562
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>            Reporter: Jaakko Laine
>             Fix For: 0.5
>
>
> I think there are currently some problems with bootstrapping & leaving. The 
> inherent problem is that a node one-sidedly announces that it is going to 
> leave/take a token without making sure it will not cause conflict, and we do 
> not have proper mechanism to clean up after a conflict. There are currently 
> ways a simultaneous bootstrap or leaving can leave the cluster 
> (tokenmetadata) in an inconsistent state and we'll need either a mechanism to 
> resolve which operation wins or make sure only one operation is in process 
> for affected ranges at one time.
> 1st option (resolve & clean conflicts as they happen):
> We could add local timestamp to bootstrap/leave gossip and resolve conflicts 
> based on that. This would allow us to choose one of the operations 
> unambiguously and reject all but the one that was first. Theoretically, if 
> different data centers are not in perfect clock sync, this might always favor 
> one DC over the other in race situations, but this would hardly create any 
> noticeable bias. Problem is, this approach would probably end up being 
> horrendously complex (if not impossible) to do properly.
> 2nd option (make sure only one operation is in process at one time for 
> affected ranges):
> Add new messaging "channel" to be used to agree beforehand who is going to 
> move. (1) Before a node can start bootstrapping, it must ask permission from 
> the node whose range it is going to bootstrap to. The request will be 
> accepted if no other node is currently bootstrapping there. Nodes of course 
> check their own token metadata before bootstrapping, but this does not guard 
> against two nodes bootstrapping simultaneously (that is, before they see each 
> others' gossip). The only node able to answer this reliably is bootstrap 
> source. (2) For leaving, the node should first check with all nodes that are 
> going to have pending ranges if it is OK to leave. If it receives "OK" from 
> all of them, then it can leave. If bootstrapping or leaving operation is 
> rejected, the node will wait for random time and start over. Eventually all 
> nodes will be able to bootstrap.
> Good in this approach would be that with a relatively small change (adding 
> one messaging exchange before actual move) we could make sure that all 
> parties involved agree that the operation is OK. This is IMHO also the 
> "cleanest" way. Downside is that we will need token lease times (how long the 
> node owns "rights" to the token) and timers to make sure that we do not end 
> in a deadlock (or lock ranges) in case the node will not complete the 
> operation. Network partitions, delays and clock skews might create very 
> difficult border cases to handle.
> 3rd option (another approach to preventing conflicts from happening):
> (3) We might take a simplistic approach and add two new states: 
> ABOUT_TO_BOOTSTRAP and ABOUT_TO_LEAVE. Whenever a node wants to bootstrap or 
> leave, it will first gossip this state with token info and wait for some 
> time. After the wait, it will check if it is OK to carry on with the 
> operation (that is, it has not received any bootstrap/leaving gossip that 
> would contradict with its own plans). Also in this case, if the operation 
> cannot be done due to conflict, the node waits for random period and start 
> over.
> This would be easiest to implement, but it would not completely remove the 
> problem as network partitions and other delays might still cause two nodes to 
> clash without knowledge of each others' intentions. Also, adding two more 
> gossip states will expand the state machine and make it more fragile.
> Don't know if I'm thinking about this in a wrong way, but to me it seems 
> resolving conflicts is very difficult, so the only option is to avoid them by 
> some mechinism that makes sure only one node is moving within affected ranges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-562) Handle pending range clash gracefully

Reply via email to