Fwd: Re: riak cluster suddenly became unresponsive

Ingo Rockel Tue, 19 Mar 2013 07:42:43 -0700

and the riak-users mailer-daemon should really set a "reply-to"...


-------- Original-Nachricht --------
Betreff: Re: riak cluster suddenly became unresponsive
Datum: Tue, 19 Mar 2013 15:40:12 +0100
Von: Ingo Rockel <[email protected]>
An: Mark Phillips <[email protected]>

Hi Mark,

thanks!

The 1.3 update is already planned.

But we will add the zdbbl first as we ran into the same issue yesterday
again.

Ingo

Am 19.03.2013 15:04, schrieb Mark Phillips:

Hi Ingo,

Sorry for the delay in getting back to you.

This looks symptomatic of some of the scheduler issues we fixed of 1.3.
A few of the    eleveldb issues in the release notes [1] can provide
precise details. Is upgrading a possibility?

Tweaking your zdbbl in vm.args should alleviate some of the issues with
busy buffers but upgrading is probably your best path here.

Hope that helps. Keep us posted.

Mark

[1] https://github.com/basho/riak/blob/master/RELEASE-NOTES.md

On Friday, March 15, 2013, Ingo Rockel wrote:

    Hi,

    we have a 12 nodes cluster running riak 1.2.1 which went live a week
    ago. Yesterday, suddenly from one minute to another the
    put_fsm_time_95 and the get_fsm_time_95 raised from something below
    100ms up to several seconds. This went on for about 25 min and than
    went away.

    Checking the riak-logs of the nodes, I find a lot of these:

    2013-03-14 17:48:06.388 [info]
    <0.62.0>@riak_core_sysmon___handler:handle_event:89 Monitor got
    {suppressed,port_events,1}
    2013-03-14 17:48:06.889 [info]
    <0.62.0>@riak_core_sysmon___handler:handle_event:85 monitor
    busy_dist_port <0.7156.1>
    
[{initial_call,{riak_core___vnode,init,1}},{almost___current_function,{erlang,bif___return_trap,1}},{message___queue_len,1}]
    {#Port<0.9083226>,'[email protected]'}

    This messages are logged all day, but only once every few minutes
    but in the problematic time frame between 17:45 and 18:17 it gets
    logged several times every second. The node ip differs though, but
    it seems only three nodes were involved.

    Except of these three nodes the cpu utilisation drops by half during
    this on all other nodes. On the three nodes there's only a slight drop.

    We are using leveldb as storage backend. I also checked some of the
    LOG files of leveldb and there are compactions logged, but these are
    logged all the day every few hours.

    In this time our software was quite unresponsive too so I would like
    to know what was causing this and what I might do to stop. Any
    ideas, hints?

    I found this:

    
https://groups.google.com/__forum/?fromgroups=#!topic/__nosql-databases/GqbaeiKCSYE
    
<https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/GqbaeiKCSYE>

    where Jon Meredith suggests to raise the buffer size to get rid of
    the busy buffers by adding +zdbbl 16384 to the vm.args file. Might
    this help?

    Regards,

             Ingo
    --
    Software Architect

    Blue Lion mobile GmbH
    Tel. +49 (0) 221 788 797 14
    Fax. +49 (0) 221 788 797 19
    Mob. +49 (0) 176 24 87 30 89

    [email protected]
     >>> qeep: Hefferwolf

    www.bluelionmobile.com <http://www.bluelionmobile.com>
    www.qeep.net <http://www.qeep.net>

    _________________________________________________
    riak-users mailing list
    [email protected]
    http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
    <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>



--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

[email protected]

qeep: Hefferwolf


www.bluelionmobile.com
www.qeep.net



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Fwd: Re: riak cluster suddenly became unresponsive

Reply via email to