I perform test again. The test environment is similar as before Address Status Load Range Ring 170141183460469231731687303715884105728 10.237.4.85 Up 378.41 MB 21267647932558653966460912964485513216 |<--| 10.237.1.135 Up 377.04 MB 42535295865117307932921825928971026432 | ^ 10.237.1.137 Up 378.21 MB 63802943797675961899382738893456539648 v | 10.237.1.139 Up 372.93 MB 85070591730234615865843651857942052864 | ^ 10.237.1.140 Up 371.95 MB 106338239662793269832304564822427566080 v | 10.237.1.141 Up 366.18 MB 127605887595351923798765477786913079296 | ^ 10.237.1.143 Up 364.12 MB 148873535527910577765226390751398592512 v | 10.237.1.144 Up 370.39 MB 170141183460469231731687303715884105728 |-->| Perform following test: 1. Kill service on 10.237.1.135, cleanup all data on that node(remove the whole data directory, not just a single table) 2. Wait some time until the other nodes found 10.237.1.135 had been down 3. restart all service except 10.237.1.135(10.237.1.135 keeping down) <-- THIS IS THE DIFFERENCE BETWEEN MY PREVIOUS TEST 4. re-configure 10.237.1.135: .... <AutoBootstrap>true</AutoBootstrap> .... <InitialToken>42535295865117307932921825928971026432</InitialToken> .... 5. start service on 10.237.1.135 6. wait some time and check the system.log of 10.237.1.135, found it indeed do bootstrap, but there is no data transfered.
In step 3, after restart all service( except 10.237.1.135 ), the cluster should has no information about existence of 10.237.1.135, and if 10.237.1.135 restart at step 5, if should do bootstrap and pull data from other node, but it's not work as expect. ---------END---------- -----Original Message----- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Monday, January 04, 2010 9:42 AM To: cassandra-user@incubator.apache.org Subject: Re: bug in bootstraping?? ... what should also work is bootstrapping the new node in (with a different IP) FIRST, in between the old node's token and it's successor's. Then part of the range's data will be transferred on bootstrap, and the rest when you decommission the old one afterwards. -Jonathan On Sun, Jan 3, 2010 at 7:20 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > This is working as designed; to use the bootstrap approach you must > removetoken the old entry first. This is not necessary with the > "nodeprobe repair" approach to recovery. I will edit the wiki to make > this more clear. > > On Sun, Jan 3, 2010 at 3:11 AM, Michael Lee > <mail.list.steel.men...@gmail.com> wrote: >> HI,guys: >> >> >> >> If one node of a cluster going down and damage it’s data, it can restore >> data by bootstrapping, theoretically (wiki link Operations) >> >> >> >> But sometimes it will lost some or all of it’s original data. >> >> >> >> Suppose an 8 nodes cluster, who has 10000 rows, each about 100k, >> ReplicationFactor is 3: >> >> Address Status Load >> Range Ring >> >> >> 170141183460469231731687303715884105728 >> >> 10.237.4.85 Up 378.41 MB >> 21267647932558653966460912964485513216 |<--| >> >> 10.237.1.135 Up 377.04 MB >> 42535295865117307932921825928971026432 | ^ >> >> 10.237.1.137 Up 378.21 MB >> 63802943797675961899382738893456539648 v | >> >> 10.237.1.139 Up 372.93 MB >> 85070591730234615865843651857942052864 | ^ >> >> 10.237.1.140 Up 371.95 MB >> 106338239662793269832304564822427566080 v | >> >> 10.237.1.141 Up 366.18 MB >> 127605887595351923798765477786913079296 | ^ >> >> 10.237.1.143 Up 364.12 MB >> 148873535527910577765226390751398592512 v | >> >> 10.237.1.144 Up 370.39 MB >> 170141183460469231731687303715884105728 |-->| >> >> >> >> Perform following test: >> >> 1. Kill service on 10.237.1.135, cleanup all data on that node(remove whole >> data directory, not just a single table). >> >> 2. Re-configure 10.237.1.135: >> >> …. >> >> <InitialToken>42535295865117307932921825928971026432</InitialToken> >> >> …. >> >> <AutoBootstrap>true</AutoBootstrap> >> >> 3. Start service on 10.237.1.135 >> >> 4. Wait a very long time, check what happens: >> >> Address Status Load >> Range Ring >> >> >> 170141183460469231731687303715884105728 >> >> 10.237.4.85 Up 378.41 MB >> 21267647932558653966460912964485513216 |<--| /// it’s seed, my cluster >> only have one seed >> >> 10.237.1.135 Up 0 bytes >> 42535295865117307932921825928971026432 | ^ /// lost all data >> >> 10.237.1.137 Up 378.21 MB >> 63802943797675961899382738893456539648 v | >> >> 10.237.1.139 Up 372.93 MB >> 85070591730234615865843651857942052864 | ^ >> >> 10.237.1.140 Up 371.95 MB >> 106338239662793269832304564822427566080 v | >> >> 10.237.1.141 Up 366.18 MB >> 127605887595351923798765477786913079296 | ^ >> >> 10.237.1.143 Up 364.12 MB >> 148873535527910577765226390751398592512 v | >> >> 10.237.1.144 Up 370.39 MB >> 170141183460469231731687303715884105728 |-->| >> >> >> >> Check system.log of 10.237.1.135, we can find 10.237.1.135 indeed do some >> bootstrapping. >> >> >> >> If use other node except 10.237.1.135 (and 10.237.4.85 of course, it’s seed, >> and seed cannot bootstrap) >> >> to repeat above test, some node can restore about 120~200MB data by >> bootstrapping, some node restore nothing. >> >> >> >> I know ‘removetoken’ can fix replica, but if removetoken first, and bring >> the node back, some data will be move twice, that’s a waste of network >> bandwidth. >> >> >> >> So, the question is: Is this “random bootstrapping” behavior bug, or >> designed to ? >> >> >> >> ---------END---------- >> >> >