Single node causing cluster to be extremely slow (leveldb)

Sean McKibben Thu, 09 Jan 2014 18:34:29 -0800

We have a 5 node cluster using elevelDB (1.4.2) and 2i, and this afternoon it 
started responding extremely slowly. CPU on member 4 was extremely high and we 
restarted that process, but it didn’t help. We temporarily shut down member 4 
and cluster speed returned to normal, but as soon as we boot member 4 back up, 
the cluster performance goes to shit.


We’ve run in to this before but were able to just start with a fresh set of 
data after wiping machines as it was before we migrated to this bare-metal 
cluster. Now it is causing some pretty significant issues and we’re not sure 
what we can do to get it back to normal, many of our queues are filling up and 
we’ll probably have to take node 4 off again just so we can provide a regular 
quality of service.

We’ve turned off AAE on node 4 but it hasn’t helped. We have some transfers 
that need to happen but they are going very slowly.

'riak-admin top’ on node 4 reports this:
 Load:  cpu       610               Memory:  total      503852    binary     
231544
        procs     804                        processes  179850    code        
11588
        runq      134                        atom          533    ets          
4581

Pid                 Name or Initial Func         Time       Reds     Memory     
  MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6175.29048.3>      proc_lib:init_p/5             '-'     462231   51356760     
     0 mochijson2:json_bin_is_safe/1
<6175.12281.6>      proc_lib:init_p/5             '-'     307183   64195856     
     1 gen_fsm:loop/7
<6175.1581.5>       proc_lib:init_p/5             '-'     286143   41085600     
     0 mochijson2:json_bin_is_safe/1
<6175.6659.0>       proc_lib:init_p/5             '-'     281845      13752     
     0 sext:decode_binary/3
<6175.6666.0>       proc_lib:init_p/5             '-'     209113      21648     
     0 sext:decode_binary/3
<6175.12219.6>      proc_lib:init_p/5             '-'     168832   16829200     
     0 riak_client:wait_for_query_results/4
<6175.8403.0>       proc_lib:init_p/5             '-'     133333      13880     
     1 eleveldb:iterator_move/2
<6175.8813.0>       proc_lib:init_p/5             '-'     119548       9000     
     1 eleveldb:iterator/3
<6175.8411.0>       proc_lib:init_p/5             '-'     115759      34472     
     0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
<6175.5679.0>       proc_lib:init_p/5             '-'     109577       8952     
     0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
Output server crashed: connection_lost

Based on that, is there anything anyone can think to do to try to bring 
performance back in to the land of usability? Does this thing appear to be 
something that may have been resolved in 1.4.6 or 1.4.7?

Only thing we can think of at this point might be to remove or force remove the 
member and join in a new freshly built one, but last time we attempted that (on 
a different cluster) our secondary indexes got irreparably damaged and only 
regained consistency when we copied every individual key to (this) new cluster! 
Not a good experience :( but i’m hopeful that 1.4.6 may have addressed some of 
our issues.

Any help is appreciated.

Thank you,
Sean McKibben


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Single node causing cluster to be extremely slow (leveldb)

Reply via email to