Hi Mark,
we have updated to riak 1.3 and raised zdbbl to 32MB but still run into
the described phenomen. 9 nodes of our 12 nodes suddenly drastically
drop their cpu utilisation, 3 nodes still have a normal cpu load, but
have some "busy_dist_port" messages in the console.log (a lot less since
the zdbbl config change though). The overall response times get
terrible. We had this right now, it stayed for some minutes and went
away. The three nodes which keep on running with normal utilisation are
not always the same.
Any further ideas?
Ingo
Am 19.03.2013 15:04, schrieb Mark Phillips:
Hi Ingo,
Sorry for the delay in getting back to you.
This looks symptomatic of some of the scheduler issues we fixed of 1.3.
A few of the eleveldb issues in the release notes [1] can provide
precise details. Is upgrading a possibility?
Tweaking your zdbbl in vm.args should alleviate some of the issues with
busy buffers but upgrading is probably your best path here.
Hope that helps. Keep us posted.
Mark
[1] https://github.com/basho/riak/blob/master/RELEASE-NOTES.md
On Friday, March 15, 2013, Ingo Rockel wrote:
Hi,
we have a 12 nodes cluster running riak 1.2.1 which went live a week
ago. Yesterday, suddenly from one minute to another the
put_fsm_time_95 and the get_fsm_time_95 raised from something below
100ms up to several seconds. This went on for about 25 min and than
went away.
Checking the riak-logs of the nodes, I find a lot of these:
2013-03-14 17:48:06.388 [info]
<0.62.0>@riak_core_sysmon___handler:handle_event:89 Monitor got
{suppressed,port_events,1}
2013-03-14 17:48:06.889 [info]
<0.62.0>@riak_core_sysmon___handler:handle_event:85 monitor
busy_dist_port <0.7156.1>
[{initial_call,{riak_core___vnode,init,1}},{almost___current_function,{erlang,bif___return_trap,1}},{message___queue_len,1}]
{#Port<0.9083226>,'[email protected]'}
This messages are logged all day, but only once every few minutes
but in the problematic time frame between 17:45 and 18:17 it gets
logged several times every second. The node ip differs though, but
it seems only three nodes were involved.
Except of these three nodes the cpu utilisation drops by half during
this on all other nodes. On the three nodes there's only a slight drop.
We are using leveldb as storage backend. I also checked some of the
LOG files of leveldb and there are compactions logged, but these are
logged all the day every few hours.
In this time our software was quite unresponsive too so I would like
to know what was causing this and what I might do to stop. Any
ideas, hints?
I found this:
https://groups.google.com/__forum/?fromgroups=#!topic/__nosql-databases/GqbaeiKCSYE
<https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/GqbaeiKCSYE>
where Jon Meredith suggests to raise the buffer size to get rid of
the busy buffers by adding +zdbbl 16384 to the vm.args file. Might
this help?
Regards,
Ingo
--
Software Architect
Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89
[email protected]
>>> qeep: Hefferwolf
www.bluelionmobile.com <http://www.bluelionmobile.com>
www.qeep.net <http://www.qeep.net>
_________________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
<http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
--
Software Architect
Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89
[email protected]
>>> qeep: Hefferwolf
www.bluelionmobile.com
www.qeep.net
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com