[jira] [Commented] (CASSANDRA-957) convenience workflow for replacing dead node

Vijay (JIRA) Thu, 25 Aug 2011 15:32:59 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091376#comment-13091376
 ]


Vijay commented on CASSANDRA-957:
---------------------------------

* In Gossiper.doStatusCheck() you made it ignore any state that is for the 
local endpoint and is not a dead state. Shouldn't it just always ignore any 
state about the local endpoint though? Basically what it was doing previously?
* Basically the same question about Gossiper.applyStateLocally() the loop 
continues if the state is for the local node and the state is dead. Why would 
we want to apply a live local state?
-> Fixed, initial intention was to find the old state of the node, Seems like 
it is not possible now…

* Does the hibernate state need the true/false value? Seems like all we care 
about is that it is set at all. Looks like we we are starting up right now we 
automatically go into a hibernate state, then we go into a bootstrap state 
afterwards if the specified a replace token. Seems like we shouldn't set a 
state at all until we know we are doing one of replace/bootstrap/just joining.
-> it will be either true or false (If not a replace, or overwrite with the 
state normal)… if you don't then Gossiper.applyStateLocally will mark it alive 
on all the other nodes.

* It looks like right now you could specify a replace token that isn't part of 
the cluster. If that happens we should throw an exception and tell the user to 
do the normal bootstrap process.
-> As we are ignoring the local states… this information is hard to gather when 
we are trying to replace the same node…. The check is to see no other live node 
owns this token….
-> We can document in the wiki about the effects if they replace a token which 
is not part of the ring…. (repair/decommission)

* Why use the last gossip time to determine if the node we are replacing is 
alive? Why not just check gossip to see if the ring thinks it is alive?
-> because by default when we hear about someone we consider them to be alive…. 
the idea is to check and see if we heard from them back or not (After the ring 
delay) if not then there is more probability that the dead node is dead (Thats 
why we have to wait for 90 + delay 

* We should update the the message for the exception that is thrown when you 
try to bootstrap to an existing token. It should indicate either remove the 
dead node or follow this replacement process.
-> I am not sure if i parse that, i have added more to it plz check.

* I'm not sure why we are calling updateNormalToken() in the 
StorageService.bootstrap() method when it's a token replacement.
-> Thats because you don't want the range request sent to the node which is not 
existing.

* A little bit of doc on this would be good, maybe in cassandra.yaml? Just on 
how to pass the argument to the startup process.
-> Yaml is bad because this is a one time thing…. Wiki page? like the don't 
join ring property

> convenience workflow for replacing dead node
> --------------------------------------------
>
>                 Key: CASSANDRA-957
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-957
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Core, Tools
>    Affects Versions: 0.8.2
>            Reporter: Jonathan Ellis
>            Assignee: Vijay
>             Fix For: 1.0
>
>         Attachments: 0001-Support-Token-Replace.patch, 
> 0001-Support-bringing-back-a-node-to-the-cluster-that-exi.patch, 
> 0001-Support-token-replace.patch, 0001-support-for-replace-token-v3.patch, 
> 0002-Do-not-include-local-node-when-computing-workMap.patch, 
> 0002-Rework-Hints-to-be-on-token.patch, 
> 0002-Rework-Hints-to-be-on-token.patch, 
> 0002-upport-for-hints-on-token-v3.patch, 
> 0003-Make-HintedHandoff-More-reliable.patch, 
> 0003-Make-hints-More-reliable.patch, 0003-making-bootstrap-sleep-longer.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Replacing a dead node with a new one is a common operation, but "nodetool 
> removetoken" followed by bootstrap is inefficient (re-replicating data first 
> to the remaining nodes, then to the new one) and manually bootstrapping to a 
> token "just less than" the old one's, followed by "nodetool removetoken" is 
> slightly painful and prone to manual errors.
> First question: how would you expose this in our tool ecosystem?  It needs to 
> be a startup-time option to the new node, so it can't be nodetool, and 
> messing with the config xml definitely takes the "convenience" out.  A 
> one-off -DreplaceToken=XXY argument?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-957) convenience workflow for replacing dead node

Reply via email to