Bringing a node back online after failure

Chris Goffinet Sat, 29 Jan 2011 23:39:18 -0800

I was looking over the Operations wiki, and with the many improvements with 
0.7, I wanted to bring up a thought.


The two options today for replacing a node that has lost all data is:

(Recommended approach) Bring up the replacement node with a new IP address, and 
AutoBootstrap set to true in storage-conf.xml. This will place the replacement 
node in the cluster and find the appropriate position automatically. Then the 
bootstrap process begins. While this process runs, the node will not receive 
reads until finished. Once this process is finished on the replacement node, 
run nodetool removetoken once, supplying the token of the dead node, and 
nodetool cleanup on each node.
(Alternative approach) Bring up a replacement node with the same IP and token 
as the old, and run nodetool repair. Until the repair process is complete, 
clients reading only from this node may get no data back. Using a higher 
ConsistencyLevel on reads will avoid this.

For nodes that might have a drive failure, but same ip address, what do you 
think about supplying the node's same token + autobootstrap set to true? This 
process works in trunk, but not all the data seems to be streamed over from 
it's replicas. This would provide the option to not let a node take on reads 
until replicas stream the SSTables over and would eliminate the alternative 
approach of forcing higher consistency levels.

-Chris

Bringing a node back online after failure

Reply via email to