Hi ALL:
*My issues:
I have a few high-capacity servers, which has about 5T disk space. I know there 
are 2 solutions for handling failure, through the wiki. But it's not very 
conveninte for me. That is :
Solution 1: new node + removetoken
Add a new node will make lots of data transfer between machines. And, of couse, 
the removetoken operation will make the data be transferred too. That means the 
data will be transferred between machines twice.

Solution 2: repaire
Though I have not try this operation yes, I find it is a heavy operation, 
because it will trigger the major compaction. 

*My objects:
I think the most conveninte way to handle one machine’s failure is that :
"1) Replace the bad node with a new one; 2) Copy data from the other nodes to 
the new node. 3) Start the service on the new node."

*Why not the standard boostrap?
At the first time, I think the boostrap can deal with this. But I’m wrong, 
because of 3 problems.
1) The new node, which has no data before bootstrapping, will be found by other 
nodes as soon as the Gossiper is OK. Before it is changed to boostrapping mode, 
it will receive many routed message and of couse, all of them failed to 
response.
2) After it is changed to booststrapping mode, other nodes will remove it from 
the ring. That means the architecture of the ring has been changed while the 
replication count is not restored. As a result, many read requests are routed 
to the wrong nodes.
3) The new node, which is in the bootstrapping mode, think itself is a new one 
in the ring and the replication count in the ring have been restored, so it 
calculates the replication nodes without itself in the ring. Of couse, it’s not 
right.

*My solutions:
1) Hide the new node’s state from the other nodes until the bootstrapping has 
done.
Then, the new node can stream the data and will not worry other nodes route 
messages at that time, because other ones think it’s down.
It seems ok to take off itself from the ‘deltaEpStateMap’ in the ACK and ACK2. 
As a result, other nodes will not see it until put itself back.

2) Add itselft into the ring temporarily while calculating the replication 
nodes.
In this way, the new node will see the original ring and can find the right 
replication nodes.

*My Questions:
Am I right? What’s the problem in my design? Why not before?
                                
--------------
XL.Pan
2010-01-14

Reply via email to