We have a 5 node cluster using elevelDB (1.4.2) and 2i, and this afternoon it
started responding extremely slowly. CPU on member 4 was extremely high and we
restarted that process, but it didn’t help. We temporarily shut down member 4
and cluster speed returned to normal, but as soon as we boot member 4 back up,
the cluster performance goes to shit.
We’ve run in to this before but were able to just start with a fresh set of
data after wiping machines as it was before we migrated to this bare-metal
cluster. Now it is causing some pretty significant issues and we’re not sure
what we can do to get it back to normal, many of our queues are filling up and
we’ll probably have to take node 4 off again just so we can provide a regular
quality of service.
We’ve turned off AAE on node 4 but it hasn’t helped. We have some transfers
that need to happen but they are going very slowly.
'riak-admin top’ on node 4 reports this:
Load: cpu 610 Memory: total 503852 binary
231544
procs 804 processes 179850 code
11588
runq 134 atom 533 ets
4581
Pid Name or Initial Func Time Reds Memory
MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6175.29048.3> proc_lib:init_p/5 '-' 462231 51356760
0 mochijson2:json_bin_is_safe/1
<6175.12281.6> proc_lib:init_p/5 '-' 307183 64195856
1 gen_fsm:loop/7
<6175.1581.5> proc_lib:init_p/5 '-' 286143 41085600
0 mochijson2:json_bin_is_safe/1
<6175.6659.0> proc_lib:init_p/5 '-' 281845 13752
0 sext:decode_binary/3
<6175.6666.0> proc_lib:init_p/5 '-' 209113 21648
0 sext:decode_binary/3
<6175.12219.6> proc_lib:init_p/5 '-' 168832 16829200
0 riak_client:wait_for_query_results/4
<6175.8403.0> proc_lib:init_p/5 '-' 133333 13880
1 eleveldb:iterator_move/2
<6175.8813.0> proc_lib:init_p/5 '-' 119548 9000
1 eleveldb:iterator/3
<6175.8411.0> proc_lib:init_p/5 '-' 115759 34472
0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
<6175.5679.0> proc_lib:init_p/5 '-' 109577 8952
0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
Output server crashed: connection_lost
Based on that, is there anything anyone can think to do to try to bring
performance back in to the land of usability? Does this thing appear to be
something that may have been resolved in 1.4.6 or 1.4.7?
Only thing we can think of at this point might be to remove or force remove the
member and join in a new freshly built one, but last time we attempted that (on
a different cluster) our secondary indexes got irreparably damaged and only
regained consistency when we copied every individual key to (this) new cluster!
Not a good experience :( but i’m hopeful that 1.4.6 may have addressed some of
our issues.
Any help is appreciated.
Thank you,
Sean McKibben
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com