Re: on bootstrapping a node

Sandeep Tata Thu, 29 Oct 2009 17:04:51 -0700

Does http://issues.apache.org/jira/browse/CASSANDRA-501 help with 0.4.1 ?


I haven't done much testing, but it seemed to fix the problem for a
simple 1 node cluster --> 2 node cluster test.

On Thu, Oct 29, 2009 at 4:42 PM, Edmond Lau <[email protected]> wrote:
> I'm not able to bootstrap a new node on either 0.4.1 or trunk.  I
> started up a simple 2 node cluster with a replication factor of 2 and
> then bootstrapped a 3rd (using -b in 0.4.1 and AutoBootstrap in
> trunk).
>
> In 0.4.1, I do observe some writes going to the new node as expected,
> but then the BOOT-STRAPPER thread throws a NPE and the node never
> shows up in nodeprobe ring.  I believe this is fixed in CASSANDRA-425:
>
> DEBUG [BOOT-STRAPPER:1] 2009-10-29 22:56:41,272 BootStrapper.java
> (line 100) Total number of old ranges 2
> DEBUG [BOOT-STRAPPER:1] 2009-10-29 22:56:41,274 BootStrapper.java
> (line 83) Exception was generated at : 10/29/2009 22:56:41 on thread
> BOOT-STRAPPER:1
>
> java.lang.NullPointerException
>        at org.apache.cassandra.dht.Range.contains(Range.java:105)
>        at 
> org.apache.cassandra.dht.LeaveJoinProtocolHelper.getRangeSplitRangeMapping(LeaveJoinProtocolHelper.java:72)
>        at 
> org.apache.cassandra.dht.BootStrapper.getRangesWithSourceTarget(BootStrapper.java:105)
>        at org.apache.cassandra.dht.BootStrapper.run(BootStrapper.java:73)
>        at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
>
> On trunk, the 3rd node never receives any writes and just sits there
> doing nothing.  It also never shows up on the nodeprobe ring:
>
>  INFO [main] 2009-10-29 23:15:24,934 StorageService.java (line 264)
> Starting in bootstrap mode (first, sleeping to get load information)
>  INFO [GMFD:1] 2009-10-29 23:15:26,423 Gossiper.java (line 634) Node
> /172.16.130.130 has now joined.
> DEBUG [GMFD:1] 2009-10-29 23:15:26,424 StorageService.java (line 389)
> CHANGE IN STATE FOR /172.16.130.130 - has token
> 129730098012431089662630620415811546756
>  INFO [GMFD:1] 2009-10-29 23:15:26,426 Gossiper.java (line 634) Node
> /172.16.130.129 has now joined.
> DEBUG [GMFD:1] 2009-10-29 23:15:26,426 StorageService.java (line 389)
> CHANGE IN STATE FOR /172.16.130.129 - has token
> 30741330848943310678704865619376516001
> DEBUG [Timer-0] 2009-10-29 23:15:26,930 LoadDisseminator.java (line
> 39) Disseminating load info ...
> DEBUG [GMFD:1] 2009-10-29 23:18:39,451 StorageService.java (line 434)
> InetAddress /172.16.130.130 just recovered from a partition. Sending
> hinted data.
> DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,454
> HintedHandOffManager.java (line 186) Started hinted handoff for
> endPoint /172.16.130.130
> DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,456
> HintedHandOffManager.java (line 225) Finished hinted handoff for
> endpoint /172.16.130.130
> DEBUG [GMFD:1] 2009-10-29 23:18:39,954 StorageService.java (line 434)
> InetAddress /172.16.130.129 just recovered from a partition. Sending
> hinted data.
> DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,955
> HintedHandOffManager.java (line 186) Started hinted handoff for
> endPoint /172.16.130.129
> DEBUG [HINTED-HANDOFF-POOL:1] 2009-10-29 23:18:39,956
> HintedHandOffManager.java (line 225) Finished hinted handoff for
> endpoint /172.16.130.129
>
> Bootstrapping the 3rd node after manually giving it an initial token
> led to an AssertionError:
>
>  INFO [main] 2009-10-29 23:25:11,720 SystemTable.java (line 125) Saved
> Token not found. Using 0
> DEBUG [main] 2009-10-29 23:25:11,878 MessagingService.java (line 203)
> Starting to listen on v31.vv.prod.ooyala.com/172.16.130.131
>  INFO [main] 2009-10-29 23:25:11,933 StorageService.java (line 264)
> Starting in bootstrap mode (first, sleeping to get load information)
>  INFO [GMFD:1] 2009-10-29 23:25:13,679 Gossiper.java (line 634) Node
> /172.16.130.130 has now joined.
> DEBUG [GMFD:1] 2009-10-29 23:25:13,680 StorageService.java (line 389)
> CHANGE IN STATE FOR /172.16.130.130 - has token
> 50846833567878089067494666696176925951
>  INFO [GMFD:1] 2009-10-29 23:25:13,682 Gossiper.java (line 634) Node
> /172.16.130.129 has now joined.
> DEBUG [GMFD:1] 2009-10-29 23:25:13,682 StorageService.java (line 389)
> CHANGE IN STATE FOR /172.16.130.129 - has token
> 44233547425983959380881840716972243602
> DEBUG [Timer-0] 2009-10-29 23:25:13,929 LoadDisseminator.java (line
> 39) Disseminating load info ...
> ERROR [main] 2009-10-29 23:25:43,754 CassandraDaemon.java (line 184)
> Exception encountered during startup.
> java.lang.AssertionError
>        at org.apache.cassandra.dht.BootStrapper.<init>(BootStrapper.java:84)
>        at 
> org.apache.cassandra.service.StorageService.start(StorageService.java:267)
>        at 
> org.apache.cassandra.service.CassandraServer.start(CassandraServer.java:72)
>        at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:94)
>        at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:166)
>
> Thoughts?
>
> Edmond
>
> On Wed, Oct 28, 2009 at 2:24 PM, Jonathan Ellis <[email protected]> wrote:
>> On Wed, Oct 28, 2009 at 1:15 PM, Edmond Lau <[email protected]> wrote:
>>> Sounds reasonable.  Until CASSANDRA-435 is complete, there's no way
>>> currently to take down a node and have it be removed from the list of
>>> nodes that's responsible for the data in its token range, correct?
>>> All other nodes will just assume that it's temporarily unavailable?
>>
>> Right.
>>
>>> Assume that we had the ability to permanently remove a node.  Would
>>> modifying the token on an existing node and restarting it with
>>> bootstrapping somehow be incorrect, or merely not performant b/c we'll
>>> be performing lazy repair on most reads until the node is up to date?
>>
>> If you permanently remove a node, wipe its data directory, and restart
>> it, it's effectively a new node, so everything works fine.  If you
>> don't wipe its data directory it won't bootstrap (and it will ignore a
>> new token in the configuration file in favor of the one it stored in
>> the system table) since it will say "hey, I must have crashed and
>> restarted.  Here I am again guys!"
>>
>> Bootstrap is for new nodes.  Don't try to be too clever. :)
>>
>>> if I wanted to
>>> migrate my cluster to a completely new set of machines.  I would then
>>> bootstrap all the new nodes in the new cluster, and then decommission
>>> my old nodes one by one (assuming
>>> https://issues.apache.org/jira/browse/CASSANDRA-435 was done).  After
>>> the migration, all my nodes would've been bootstrapped.
>>
>> Sure.
>>
>> -Jonathan
>>
>

Re: on bootstrapping a node

Reply via email to