Does http://issues.apache.org/jira/browse/CASSANDRA-501 help with 0.4.1 ?
I haven't done much testing, but it seemed to fix the problem for a simple 1 node cluster --> 2 node cluster test. On Thu, Oct 29, 2009 at 4:42 PM, Edmond Lau <[email protected]> wrote: > I'm not able to bootstrap a new node on either 0.4.1 or trunk. I > started up a simple 2 node cluster with a replication factor of 2 and > then bootstrapped a 3rd (using -b in 0.4.1 and AutoBootstrap in > trunk). > > In 0.4.1, I do observe some writes going to the new node as expected, > but then the BOOT-STRAPPER thread throws a NPE and the node never > shows up in nodeprobe ring. I believe this is fixed in CASSANDRA-425: > > DEBUG [BOOT-STRAPPER:1] 2009-10-29 22:56:41,272 BootStrapper.java > (line 100) Total number of old ranges 2 > DEBUG [BOOT-STRAPPER:1] 2009-10-29 22:56:41,274 BootStrapper.java > (line 83) Exception was generated at : 10/29/2009 22:56:41 on thread > BOOT-STRAPPER:1 > > java.lang.NullPointerException > at org.apache.cassandra.dht.Range.contains(Range.java:105) > at > org.apache.cassandra.dht.LeaveJoinProtocolHelper.getRangeSplitRangeMapping(LeaveJoinProtocolHelper.java:72) > at > org.apache.cassandra.dht.BootStrapper.getRangesWithSourceTarget(BootStrapper.java:105) > at org.apache.cassandra.dht.BootStrapper.run(BootStrapper.java:73) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > > On trunk, the 3rd node never receives any writes and just sits there > doing nothing. It also never shows up on the nodeprobe ring: > > INFO [main] 2009-10-29 23:15:24,934 StorageService.java (line 264) > Starting in bootstrap mode (first, sleeping to get load information) > INFO [GMFD:1] 2009-10-29 23:15:26,423 Gossiper.java (line 634) Node > /172.16.130.130 has now joined. > DEBUG [GMFD:1] 2009-10-29 23:15:26,424 StorageService.java (line 389) > CHANGE IN STATE FOR /172.16.130.130 - has token > 129730098012431089662630620415811546756 > INFO [GMFD:1] 2009-10-29 23:15:26,426 Gossiper.java (line 634) Node > /172.16.130.129 has now joined. > DEBUG [GMFD:1] 2009-10-29 23:15:26,426 StorageService.java (line 389) > CHANGE IN STATE FOR /172.16.130.129 - has token > 30741330848943310678704865619376516001 > DEBUG [Timer-0] 2009-10-29 23:15:26,930 LoadDisseminator.java (line > 39) Disseminating load info ... > DEBUG [GMFD:1] 2009-10-29 23:18:39,451 StorageService.java (line 434) > InetAddress /172.16.130.130 just recovered from a partition. Sending > hinted data. > DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,454 > HintedHandOffManager.java (line 186) Started hinted handoff for > endPoint /172.16.130.130 > DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,456 > HintedHandOffManager.java (line 225) Finished hinted handoff for > endpoint /172.16.130.130 > DEBUG [GMFD:1] 2009-10-29 23:18:39,954 StorageService.java (line 434) > InetAddress /172.16.130.129 just recovered from a partition. Sending > hinted data. > DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,955 > HintedHandOffManager.java (line 186) Started hinted handoff for > endPoint /172.16.130.129 > DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,956 > HintedHandOffManager.java (line 225) Finished hinted handoff for > endpoint /172.16.130.129 > > Bootstrapping the 3rd node after manually giving it an initial token > led to an AssertionError: > > INFO [main] 2009-10-29 23:25:11,720 SystemTable.java (line 125) Saved > Token not found. Using 0 > DEBUG [main] 2009-10-29 23:25:11,878 MessagingService.java (line 203) > Starting to listen on v31.vv.prod.ooyala.com/172.16.130.131 > INFO [main] 2009-10-29 23:25:11,933 StorageService.java (line 264) > Starting in bootstrap mode (first, sleeping to get load information) > INFO [GMFD:1] 2009-10-29 23:25:13,679 Gossiper.java (line 634) Node > /172.16.130.130 has now joined. > DEBUG [GMFD:1] 2009-10-29 23:25:13,680 StorageService.java (line 389) > CHANGE IN STATE FOR /172.16.130.130 - has token > 50846833567878089067494666696176925951 > INFO [GMFD:1] 2009-10-29 23:25:13,682 Gossiper.java (line 634) Node > /172.16.130.129 has now joined. > DEBUG [GMFD:1] 2009-10-29 23:25:13,682 StorageService.java (line 389) > CHANGE IN STATE FOR /172.16.130.129 - has token > 44233547425983959380881840716972243602 > DEBUG [Timer-0] 2009-10-29 23:25:13,929 LoadDisseminator.java (line > 39) Disseminating load info ... > ERROR [main] 2009-10-29 23:25:43,754 CassandraDaemon.java (line 184) > Exception encountered during startup. > java.lang.AssertionError > at org.apache.cassandra.dht.BootStrapper.<init>(BootStrapper.java:84) > at > org.apache.cassandra.service.StorageService.start(StorageService.java:267) > at > org.apache.cassandra.service.CassandraServer.start(CassandraServer.java:72) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:94) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:166) > > Thoughts? > > Edmond > > On Wed, Oct 28, 2009 at 2:24 PM, Jonathan Ellis <[email protected]> wrote: >> On Wed, Oct 28, 2009 at 1:15 PM, Edmond Lau <[email protected]> wrote: >>> Sounds reasonable. Until CASSANDRA-435 is complete, there's no way >>> currently to take down a node and have it be removed from the list of >>> nodes that's responsible for the data in its token range, correct? >>> All other nodes will just assume that it's temporarily unavailable? >> >> Right. >> >>> Assume that we had the ability to permanently remove a node. Would >>> modifying the token on an existing node and restarting it with >>> bootstrapping somehow be incorrect, or merely not performant b/c we'll >>> be performing lazy repair on most reads until the node is up to date? >> >> If you permanently remove a node, wipe its data directory, and restart >> it, it's effectively a new node, so everything works fine. If you >> don't wipe its data directory it won't bootstrap (and it will ignore a >> new token in the configuration file in favor of the one it stored in >> the system table) since it will say "hey, I must have crashed and >> restarted. Here I am again guys!" >> >> Bootstrap is for new nodes. Don't try to be too clever. :) >> >>> if I wanted to >>> migrate my cluster to a completely new set of machines. I would then >>> bootstrap all the new nodes in the new cluster, and then decommission >>> my old nodes one by one (assuming >>> https://issues.apache.org/jira/browse/CASSANDRA-435 was done). After >>> the migration, all my nodes would've been bootstrapped. >> >> Sure. >> >> -Jonathan >> >
