Re: Potential problem with 0.5 branch (Possibly in gossiping?)

Ramzi Rabah Tue, 22 Dec 2009 19:21:44 -0800

Should I log a bug for this. Looks pretty serious, if restarting a
node causes these kinds of fluctuations on it.


On Tue, Dec 22, 2009 at 7:12 PM, Ramzi Rabah <[email protected]> wrote:
> Watching it for a little longer, it went up again to 230 where it
> settled for about a few minutes, and now it dropped back to 0. Very
> strange.
>
> On Tue, Dec 22, 2009 at 7:01 PM, Ramzi Rabah <[email protected]> wrote:
>> Hi Jaako thanks for your response.
>>
>> I compiled the very latest from 0.5 branch yesterday (whatever
>> yesterday nights build was). I do see that Node X.X.X.X is dead, and
>> Node X.X.X.X has restarted.
>>
>> This show up on all the 3 other servers:
>>  INFO [Timer-1] 2009-12-22 20:38:43,738 Gossiper.java (line 194)
>> InetAddress /10.6.168.20 is now dead.
>>
>> Node /10.6.168.20 has restarted, now UP again
>>  INFO [GMFD:1] 2009-12-22 20:43:12,812 StorageService.java (line 475)
>> Node /10.6.168.20 state jump to normal
>>
>> This time the first time I restarted the node it seemed fine, but the
>> second time I restarted it, this is what cfstats is showing for
>> traffic on it :
>>
>>                Column Family: Datastore
>>                Memtable Columns Count: 407
>>                Memtable Data Size: 42268
>>                Memtable Switch Count: 1
>>                Read Count: 0
>>                Read Latency: NaN ms.
>>                Write Count: 0
>>                Write Latency: NaN ms.
>>                Pending Tasks: 0
>>
>> and then it went up and now it's back to:
>>
>>          Column Family: Datastore
>>                Memtable Columns Count: 2331
>>                Memtable Data Size: 242364
>>                Memtable Switch Count: 1
>>                Read Count: 107
>>                Read Latency: 0.486 ms.
>>                Write Count: 113
>>                Write Latency: 0.000 ms.
>>                Pending Tasks: 0
>>
>> which is half the traffic the other nodes are showing. The other 3
>> nodes are showing a consistent ~230 reads/writes per second, which
>> node 4 was showing before it was restarted. I hope data is not being
>> lost in the process?
>>
>>
>> On Tue, Dec 22, 2009 at 4:43 PM, Jaakko <[email protected]> wrote:
>>> Hi,
>>>
>>> Which revision number you are running?
>>>
>>> Can you see any log lines related to node being UP or dead? (like
>>> "InetAddress X.X.X.X is now dead" or "Node X.X.X.X has restarted, now
>>> UP again"). These messages come from the Gossiper and indicate if it
>>> for some reason thinks the node is dead. Level of these messages is
>>> info.
>>>
>>> Another thing is: can you see any log messages like "Node X.X.X.X
>>> state normal, token XXX"? These are on debug level.
>>>
>>> -Jaakko
>>>
>>>
>>> On Wed, Dec 23, 2009 at 12:59 AM, Ramzi Rabah <[email protected]> wrote:
>>>> I just recently upgraded to latest in 0.5 branch, and I am running
>>>> into a serious issue. I have a cluster with 4 nodes, rackunaware
>>>> strategy, and using my own tokens distributed evenly over the hash
>>>> space. I am writing/reading equally to them at an equal rate of about
>>>> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
>>>> are seeds, the last one isn't. When I start all the nodes together at
>>>> the same time, they all receive equal amounts of reads/writes (about
>>>> 230).
>>>> When I bring node 4 down and bring it back up again, node 4's load
>>>> fluctuates between the 230 it used to get to sometimes no traffic at
>>>> all. The other 3 still have the same amount of traffic. And no errors
>>>> what so ever seen in logs. Any ideas what can be causing this
>>>> fluctuation on node 4 after I restarted it?
>>>>
>>>
>>
>

Re: Potential problem with 0.5 branch (Possibly in gossiping?)

Reply via email to