[
https://issues.apache.org/jira/browse/CASSANDRA-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091376#comment-13091376
]
Vijay commented on CASSANDRA-957:
---------------------------------
* In Gossiper.doStatusCheck() you made it ignore any state that is for the
local endpoint and is not a dead state. Shouldn't it just always ignore any
state about the local endpoint though? Basically what it was doing previously?
* Basically the same question about Gossiper.applyStateLocally() the loop
continues if the state is for the local node and the state is dead. Why would
we want to apply a live local state?
-> Fixed, initial intention was to find the old state of the node, Seems like
it is not possible now…
* Does the hibernate state need the true/false value? Seems like all we care
about is that it is set at all. Looks like we we are starting up right now we
automatically go into a hibernate state, then we go into a bootstrap state
afterwards if the specified a replace token. Seems like we shouldn't set a
state at all until we know we are doing one of replace/bootstrap/just joining.
-> it will be either true or false (If not a replace, or overwrite with the
state normal)… if you don't then Gossiper.applyStateLocally will mark it alive
on all the other nodes.
* It looks like right now you could specify a replace token that isn't part of
the cluster. If that happens we should throw an exception and tell the user to
do the normal bootstrap process.
-> As we are ignoring the local states… this information is hard to gather when
we are trying to replace the same node…. The check is to see no other live node
owns this token….
-> We can document in the wiki about the effects if they replace a token which
is not part of the ring…. (repair/decommission)
* Why use the last gossip time to determine if the node we are replacing is
alive? Why not just check gossip to see if the ring thinks it is alive?
-> because by default when we hear about someone we consider them to be alive….
the idea is to check and see if we heard from them back or not (After the ring
delay) if not then there is more probability that the dead node is dead (Thats
why we have to wait for 90 + delay
* We should update the the message for the exception that is thrown when you
try to bootstrap to an existing token. It should indicate either remove the
dead node or follow this replacement process.
-> I am not sure if i parse that, i have added more to it plz check.
* I'm not sure why we are calling updateNormalToken() in the
StorageService.bootstrap() method when it's a token replacement.
-> Thats because you don't want the range request sent to the node which is not
existing.
* A little bit of doc on this would be good, maybe in cassandra.yaml? Just on
how to pass the argument to the startup process.
-> Yaml is bad because this is a one time thing…. Wiki page? like the don't
join ring property
> convenience workflow for replacing dead node
> --------------------------------------------
>
> Key: CASSANDRA-957
> URL: https://issues.apache.org/jira/browse/CASSANDRA-957
> Project: Cassandra
> Issue Type: Wish
> Components: Core, Tools
> Affects Versions: 0.8.2
> Reporter: Jonathan Ellis
> Assignee: Vijay
> Fix For: 1.0
>
> Attachments: 0001-Support-Token-Replace.patch,
> 0001-Support-bringing-back-a-node-to-the-cluster-that-exi.patch,
> 0001-Support-token-replace.patch, 0001-support-for-replace-token-v3.patch,
> 0002-Do-not-include-local-node-when-computing-workMap.patch,
> 0002-Rework-Hints-to-be-on-token.patch,
> 0002-Rework-Hints-to-be-on-token.patch,
> 0002-upport-for-hints-on-token-v3.patch,
> 0003-Make-HintedHandoff-More-reliable.patch,
> 0003-Make-hints-More-reliable.patch, 0003-making-bootstrap-sleep-longer.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Replacing a dead node with a new one is a common operation, but "nodetool
> removetoken" followed by bootstrap is inefficient (re-replicating data first
> to the remaining nodes, then to the new one) and manually bootstrapping to a
> token "just less than" the old one's, followed by "nodetool removetoken" is
> slightly painful and prone to manual errors.
> First question: how would you expose this in our tool ecosystem? It needs to
> be a startup-time option to the new node, so it can't be nodetool, and
> messing with the config xml definitely takes the "convenience" out. A
> one-off -DreplaceToken=XXY argument?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira