and the riak-users mailer-daemon should really set a "reply-to"...
-------- Original-Nachricht -------- Betreff: Re: riak cluster suddenly became unresponsive Datum: Tue, 19 Mar 2013 15:40:12 +0100 Von: Ingo Rockel <[email protected]> An: Mark Phillips <[email protected]> Hi Mark, thanks! The 1.3 update is already planned. But we will add the zdbbl first as we ran into the same issue yesterday again. Ingo Am 19.03.2013 15:04, schrieb Mark Phillips:
Hi Ingo, Sorry for the delay in getting back to you. This looks symptomatic of some of the scheduler issues we fixed of 1.3. A few of the eleveldb issues in the release notes [1] can provide precise details. Is upgrading a possibility? Tweaking your zdbbl in vm.args should alleviate some of the issues with busy buffers but upgrading is probably your best path here. Hope that helps. Keep us posted. Mark [1] https://github.com/basho/riak/blob/master/RELEASE-NOTES.md On Friday, March 15, 2013, Ingo Rockel wrote: Hi, we have a 12 nodes cluster running riak 1.2.1 which went live a week ago. Yesterday, suddenly from one minute to another the put_fsm_time_95 and the get_fsm_time_95 raised from something below 100ms up to several seconds. This went on for about 25 min and than went away. Checking the riak-logs of the nodes, I find a lot of these: 2013-03-14 17:48:06.388 [info] <0.62.0>@riak_core_sysmon___handler:handle_event:89 Monitor got {suppressed,port_events,1} 2013-03-14 17:48:06.889 [info] <0.62.0>@riak_core_sysmon___handler:handle_event:85 monitor busy_dist_port <0.7156.1> [{initial_call,{riak_core___vnode,init,1}},{almost___current_function,{erlang,bif___return_trap,1}},{message___queue_len,1}] {#Port<0.9083226>,'[email protected]'} This messages are logged all day, but only once every few minutes but in the problematic time frame between 17:45 and 18:17 it gets logged several times every second. The node ip differs though, but it seems only three nodes were involved. Except of these three nodes the cpu utilisation drops by half during this on all other nodes. On the three nodes there's only a slight drop. We are using leveldb as storage backend. I also checked some of the LOG files of leveldb and there are compactions logged, but these are logged all the day every few hours. In this time our software was quite unresponsive too so I would like to know what was causing this and what I might do to stop. Any ideas, hints? I found this: https://groups.google.com/__forum/?fromgroups=#!topic/__nosql-databases/GqbaeiKCSYE <https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/GqbaeiKCSYE> where Jon Meredith suggests to raise the buffer size to get rid of the busy buffers by adding +zdbbl 16384 to the vm.args file. Might this help? Regards, Ingo -- Software Architect Blue Lion mobile GmbH Tel. +49 (0) 221 788 797 14 Fax. +49 (0) 221 788 797 19 Mob. +49 (0) 176 24 87 30 89 [email protected] >>> qeep: Hefferwolf www.bluelionmobile.com <http://www.bluelionmobile.com> www.qeep.net <http://www.qeep.net> _________________________________________________ riak-users mailing list [email protected] http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
-- Software Architect Blue Lion mobile GmbH Tel. +49 (0) 221 788 797 14 Fax. +49 (0) 221 788 797 19 Mob. +49 (0) 176 24 87 30 89 [email protected]
qeep: Hefferwolf
www.bluelionmobile.com www.qeep.net _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
