Re: riak cluster suddenly became unresponsive

Ingo Rockel Wed, 27 Mar 2013 03:20:14 -0700

Hi Mark,

we have updated to riak 1.3 and raised zdbbl to 32MB but still run intothe described phenomen. 9 nodes of our 12 nodes suddenly drasticallydrop their cpu utilisation, 3 nodes still have a normal cpu load, buthave some "busy_dist_port" messages in the console.log (a lot less sincethe zdbbl config change though). The overall response times getterrible. We had this right now, it stayed for some minutes and wentaway. The three nodes which keep on running with normal utilisation arenot always the same.


Any further ideas?

Ingo

Am 19.03.2013 15:04, schrieb Mark Phillips:

Hi Ingo,

Sorry for the delay in getting back to you.

This looks symptomatic of some of the scheduler issues we fixed of 1.3.
A few of the    eleveldb issues in the release notes [1] can provide
precise details. Is upgrading a possibility?

Tweaking your zdbbl in vm.args should alleviate some of the issues with
busy buffers but upgrading is probably your best path here.

Hope that helps. Keep us posted.

Mark

[1] https://github.com/basho/riak/blob/master/RELEASE-NOTES.md

On Friday, March 15, 2013, Ingo Rockel wrote:

    Hi,

    we have a 12 nodes cluster running riak 1.2.1 which went live a week
    ago. Yesterday, suddenly from one minute to another the
    put_fsm_time_95 and the get_fsm_time_95 raised from something below
    100ms up to several seconds. This went on for about 25 min and than
    went away.

    Checking the riak-logs of the nodes, I find a lot of these:

    2013-03-14 17:48:06.388 [info]
    <0.62.0>@riak_core_sysmon___handler:handle_event:89 Monitor got
    {suppressed,port_events,1}
    2013-03-14 17:48:06.889 [info]
    <0.62.0>@riak_core_sysmon___handler:handle_event:85 monitor
    busy_dist_port <0.7156.1>
    
[{initial_call,{riak_core___vnode,init,1}},{almost___current_function,{erlang,bif___return_trap,1}},{message___queue_len,1}]
    {#Port<0.9083226>,'[email protected]'}

    This messages are logged all day, but only once every few minutes
    but in the problematic time frame between 17:45 and 18:17 it gets
    logged several times every second. The node ip differs though, but
    it seems only three nodes were involved.

    Except of these three nodes the cpu utilisation drops by half during
    this on all other nodes. On the three nodes there's only a slight drop.

    We are using leveldb as storage backend. I also checked some of the
    LOG files of leveldb and there are compactions logged, but these are
    logged all the day every few hours.

    In this time our software was quite unresponsive too so I would like
    to know what was causing this and what I might do to stop. Any
    ideas, hints?

    I found this:

    
https://groups.google.com/__forum/?fromgroups=#!topic/__nosql-databases/GqbaeiKCSYE
    
<https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/GqbaeiKCSYE>

    where Jon Meredith suggests to raise the buffer size to get rid of
    the busy buffers by adding +zdbbl 16384 to the vm.args file. Might
    this help?

    Regards,

             Ingo
    --
    Software Architect

    Blue Lion mobile GmbH
    Tel. +49 (0) 221 788 797 14
    Fax. +49 (0) 221 788 797 19
    Mob. +49 (0) 176 24 87 30 89

    [email protected]
     >>> qeep: Hefferwolf

    www.bluelionmobile.com <http://www.bluelionmobile.com>
    www.qeep.net <http://www.qeep.net>

    _________________________________________________
    riak-users mailing list
    [email protected]
    http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
    <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>



--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

[email protected]
>>> qeep: Hefferwolf

www.bluelionmobile.com
www.qeep.net

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: riak cluster suddenly became unresponsive

Reply via email to