Re: Bootstrapping taking long

Ran Tavory Tue, 04 Jan 2011 11:27:15 -0800

Thanks Jake, but unfortunately the streams directory is empty so I don't
think that any of the nodes is anti-compacting data right now or had been in
the past 5 hours.
It seems that all the data was already transferred to the joining host but
the joining node, after having received the data would still remain in
bootstrapping mode and not join the cluster. I'm not sure that *all* data
was transferred (perhaps other nodes need to transfer more data) but nothing
is actually happening so I assume all has been moved.
Perhaps it's a configuration error from my part. Should I use I use
AutoBootstrap=true ? Anything else I should look out for in the
configuration file or something else?



On Tue, Jan 4, 2011 at 4:08 PM, Jake Luciani <jak...@gmail.com> wrote:

> In 0.6, locate the node doing anti-compaction and look in the "streams"
> subdirectory in the keyspace data dir to monitor the anti-compaction
> progress (it puts new SSTables for bootstrapping node in there)
>
>
> On Tue, Jan 4, 2011 at 8:01 AM, Ran Tavory <ran...@gmail.com> wrote:
>
>> Running nodetool decommission didn't help. Actually the node refused to
>> decommission itself (b/c it wasn't part of the ring). So I simply stopped
>> the process, deleted all the data directories and started it again. It
>> worked in the sense of the node bootstrapped again but as before, after it
>> had finished moving the data nothing happened for a long time (I'm still
>> waiting, but nothing seems to be happening).
>>
>> Any hints how to analyze a "stuck" bootstrapping node??
>> thanks
>>
>> On Tue, Jan 4, 2011 at 1:51 PM, Ran Tavory <ran...@gmail.com> wrote:
>>
>>> Thanks Shimi, so indeed anticompaction was run on one of the other nodes
>>> from the same DC but to my understanding it has already ended. A few hour
>>> ago...
>>> I plenty of log messages such as [1] which ended a couple of hours ago,
>>> and I've seen the new node streaming and accepting the data from the node
>>> which performed the anticompaction and so far it was normal so it seemed
>>> that data is at its right place. But now the new node seems sort of stuck.
>>> None of the other nodes is anticompacting right now or had been
>>> anticompacting since then.
>>> The new node's CPU is close to zero, it's iostats are almost zero so I
>>> can't find another bottleneck that would keep it hanging.
>>>
>>> On the IRC someone suggested I'd maybe retry to join this node,
>>> e.g. decommission and rejoin it again. I'll try it now...
>>>
>>>
>>> [1]
>>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721 CompactionManager.java
>>> (line 338) AntiCompacting
>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')]
>>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:18,683 CompactionManager.java
>>> (line 338) AntiCompacting
>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3874-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3873-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3876-Data.db')]
>>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:19,132 CompactionManager.java
>>> (line 338) AntiCompacting
>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-951-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-976-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-978-Data.db')]
>>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:26,486 CompactionManager.java
>>> (line 338) AntiCompacting
>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')]
>>>
>>> On Tue, Jan 4, 2011 at 12:45 PM, shimi <shim...@gmail.com> wrote:
>>>
>>>> In my experience most of the time it takes for a node to join the
>>>> cluster is the anticompaction on the other nodes. The streaming part is 
>>>> very
>>>> fast.
>>>> Check the other nodes logs to see if there is any node doing
>>>> anticompaction.
>>>> I don't remember how much data I had in the cluster when I needed to
>>>> add/remove nodes. I do remember that it took a few hours.
>>>>
>>>> The node will join the ring only when it will finish the bootstrap.
>>>>
>>>> Shimi
>>>>
>>>>
>>>> On Tue, Jan 4, 2011 at 12:28 PM, Ran Tavory <ran...@gmail.com> wrote:
>>>>
>>>>> I asked the same question on the IRC but no luck there, everyone's
>>>>> asleep ;)...
>>>>>
>>>>> Using 0.6.6 I'm adding a new node to the cluster.
>>>>> It starts out fine but then gets stuck on the bootstrapping state for
>>>>> too long. More than an hour and still counting.
>>>>>
>>>>> $ bin/nodetool -p 9004 -h localhost streams
>>>>>> Mode: Bootstrapping
>>>>>> Not sending any streams.
>>>>>> Not receiving any streams.
>>>>>
>>>>>
>>>>> It seemed to have streamed data from other nodes and indeed the load is
>>>>> non-zero but I'm not clear what's keeping it right now from finishing.
>>>>>
>>>>>> $ bin/nodetool -p 9004 -h localhost info
>>>>>> 51042355038140769519506191114765231716
>>>>>> Load             : 22.49 GB
>>>>>> Generation No    : 1294133781
>>>>>> Uptime (seconds) : 1795
>>>>>> Heap Memory (MB) : 315.31 / 6117.00
>>>>>
>>>>>
>>>>> nodetool ring does not list this new node in the ring, although
>>>>> nodetool can happily talk to the new node, it's just not listing itself 
>>>>> as a
>>>>> member of the ring. This is expected when the node is still bootstrapping,
>>>>> so the question is still how long might the bootstrap take and whether is 
>>>>> it
>>>>> stuck.
>>>>>
>>>>> The data ins't huge so I find it hard to believe that streaming or anti
>>>>> compaction are the bottlenecks. I have ~20G on each node and the new node
>>>>> already has just about that so it seems that all data had already been
>>>>> streamed to it successfully, or at least most of the data... So what is it
>>>>> waiting for now? (same question, rephrased... ;)
>>>>>
>>>>> I tried:
>>>>> 1. Restarting the new node. No good. All logs seem normal but at the
>>>>> end the node is still in bootstrap mode.
>>>>> 2. As someone suggested I increased the rpc timeout from 10k to 30k
>>>>> (RpcTimeoutInMillis) but that didn't seem to help. I did this only on the
>>>>> new node. Should I have done that on all (old) nodes as well? Or maybe 
>>>>> only
>>>>> on the ones that were supposed to stream data to that node.
>>>>> 3. Logging level at DEBUG now but nothing interesting going on except
>>>>> for occasional messages such as [1] or [2]
>>>>>
>>>>> So the question is: what's keeping the new node from finishing the
>>>>> bootstrap and how can I check its status?
>>>>> Thanks
>>>>>
>>>>> [1] DEBUG [Timer-1] 2011-01-04 05:21:24,402 LoadDisseminator.java (line
>>>>> 36) Disseminating load info ...
>>>>> [2] DEBUG [RMI TCP Connection(22)-192.168.252.88] 2011-01-04
>>>>> 05:12:48,033 StorageService.java (line 1189) computing ranges for
>>>>> 28356863910078205288614550619314017621,
>>>>> 56713727820156410577229101238628035242,
>>>>>  85070591730234615865843651857942052863,
>>>>> 113427455640312821154458202477256070484,
>>>>> 141784319550391026443072753096570088105,
>>>>> 170141183460469231731687303715884105727
>>>>>
>>>>> --
>>>>> /Ran
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> /Ran
>>>
>>>
>>
>>
>> --
>> /Ran
>>
>>
>


-- 
/Ran

Re: Bootstrapping taking long

Reply via email to