I watched top on all the instances when things started to fall apart.  This is 
what I saw…

Everything was jamming along just fine.  CPU usage was about 25%, ram usage was 
about 25% (3 of the 7 were at about 15%).

Suddenly, CPU usage spikes to over 50% and ram usage spikes to 80-90% (and I'm 
guessing on those nodes that crash, it hit over 100%).  This happens within 
seconds, its not a gradual growth in RAM or CPU but a spike.  On those nodes 
that survived, they stayed at this higher water mark for RAM (and CPU keeps 
running at the incremental difference since I killed the client).  Is there any 
easy way to figure out what the process is working on?

Could compaction cause this?  Basically I have hit a place where it needs to 
merge SSTables to a higher level, and there isn't enough RAM to pull the data 
set into memory?

Paul



Paul Ingalls
Founder & CEO Fanzo
[email protected]
@paulingalls
http://www.linkedin.com/in/paulingalls



On Aug 5, 2013, at 2:58 PM, Paul Ingalls <[email protected]> wrote:

> Hey Kresten,
> 
> Thanks for the response!
> 
> I learned my lesson on setting bucket properties.  So all buckets currently 
> use the defaults.
> 
> here is the output from one of our nodes:
> 
> total 40
> drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
> drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
> -rw-r--r-- 1 root root 14187 Aug  4 17:37 
> riak_core_ring.default.20130804173753
> -rw-r--r-- 1 root root 14586 Aug  5 21:10 
> riak_core_ring.default.20130805211043
> 
> another node:
> 
> total 72
> drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
> drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
> -rw-r--r-- 1 root root 14187 Aug  4 17:37 
> riak_core_ring.default.20130804173753
> -rw-r--r-- 1 root root 14358 Aug  5 20:56 
> riak_core_ring.default.20130805205650
> -rw-r--r-- 1 root root 14529 Aug  5 21:09 
> riak_core_ring.default.20130805210924
> -rw-r--r-- 1 root root 14586 Aug  5 21:10 
> riak_core_ring.default.20130805211043
> 
> looks like the largest number of files in that directory on any node is 5
> 
> Paul
> 
> Paul Ingalls
> Founder & CEO Fanzo
> [email protected]
> @paulingalls
> http://www.linkedin.com/in/paulingalls
> 
> 
> 
> On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup <[email protected]> wrote:
> 
>> I'd think the large #buckets could be the issue; especially if there is any 
>> bucket properties being set, because that would cause the ring data 
>> structure to be enormous.
>> 
>> Could you provide an "ls -l" output of the riak data/ring directory?
>> 
>> Sent from my iPhone
>> 
>> On 05/08/2013, at 21.52, "Paul Ingalls" 
>> <[email protected]<mailto:[email protected]>> wrote:
>> 
>> As promised in previous email, I hit a fairly big problem over the weekend 
>> and then reproduced it this morning and I was wondering if I could get some 
>> help.
>> 
>> Basically, I was running my code against our risk cluster and everything was 
>> moving along just fine.  However, at some point Riak just seems to hit a 
>> wall.  I get to a certain scale of content, about 9-10 GB per node, and 
>> about half the cluster gives up the ghost.
>> 
>> I've done this twice from scratch now, the first time I thought maybe I was 
>> trying to push too many transactions per second.  But the second time I 
>> reduced the speed and changed some settings in the app.config that I thought 
>> would reduce memory usage.  Ran it again, and 24 hrs later I still hit the 
>> wall (almost the same place).
>> 
>> So, I figure I'm still doing something wrong, I'm just not sure what.  I'm 
>> obviously hitting some kind of heap issue, but I'm not sure what else I can 
>> do about it.  I'm hoping there is something obvious I'm missing in my 
>> ignorance, cuz at the moment I'm stuck...
>> 
>> Some details:
>> 
>> 7 node cluster running 1.4
>> each VM has 7GB RAM and 4 CPUs
>> the data directory is on a RAID0 with 750GB of space
>> 128 partitions, levelDB backend
>> using links and secondary indexes
>> lots of buckets (over 10 million), a couple buckets have lots of keys (one 
>> was around 6.6 million keys when it crashed, the other around 3.7 million).
>> Values are pretty small, almost all are just a few bytes.  There is one 
>> bucket, the largest (6.6 million keys), with value sizes between 1-2k.
>> 
>> primary custom app config settings:
>> {kernel,
>>           [
>>               {inet_dist_listen_min, 6000},
>>               {inet_dist_listen_max, 7999}
>>           ]},
>> 
>> {riak_core, [
>>               {default_bucket_props, [
>>                     {allow_mult, true},
>>                     {r, 1},
>>                     {w, 1},
>>                     {dw, 1},
>>                     {dw, 1}
>>                    ]},
>>              {ring_creation_size, 128},
>> ]},
>> 
>> {riak_kv, [
>>            {storage_backend, riak_kv_eleveldb_backend},
>> ]}
>> 
>> {eleveldb, [
>>             {max_open_files, 32}
>>            ]},
>> 
>> custom vm.args settings
>> +A 16
>> 
>> Here is some of the error information I am seeing when it all goes boom:
>> 
>> *************************************
>> * On one of the nodes that crashes:
>> ************************************
>> last lines of console.log
>> -------------------
>> 2013-08-05 17:59:08.071 [info] 
>> <0.29251.556>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
>> active anti-entropy exchange of 
>> {468137243207554840987117797979434404733540892672,3} between 
>> {479555224749202520035584085735030365824602865664,riak@riak001} and 
>> {490973206290850199084050373490626326915664838656,riak@riak002}
>> 2013-08-05 18:01:10.234 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}]
>> 2013-08-05 18:01:10.672 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.12133.557> 
>> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,40978360}]
>> 2013-08-05 18:01:12.993 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}]
>> 2013-08-05 18:01:13.816 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}]
>> 2013-08-05 18:01:13.819 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.12133.557> 
>> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,pack,5}},{message_queue_len,0}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,200971},{old_heap_size,0},{heap_size,47627275}]
>> 2013-08-05 18:01:14.594 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,49},{old_heap_size,0},{heap_size,81956796}]
>> 2013-08-05 18:01:15.729 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.12133.557> 
>> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,encode,2}},{message_queue_len,0}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,74732575},{mbuf_size,0},{stack_size,81},{old_heap_size,0},{heap_size,42778355}]
>> 2013-08-05 18:01:15.878 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,49},{old_heap_size,0},{heap_size,81956794}]
>> 2013-08-05 18:01:15.878 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,41},{old_heap_size,0},{heap_size,81956788}]
>> 2013-08-05 18:01:15.878 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956774}]
>> 2013-08-05 18:01:15.878 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.14832.557> 
>> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]
>> 2013-08-05 18:01:15.880 [info] 
>> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.12133.557> 
>> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{lists,reverse,1}},{message_queue_len,0}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,74732575},{mbuf_size,0},{stack_size,71},{old_heap_size,0},{heap_size,69315695}]
>> 
>> 
>> in erlang.log:
>> ----------------------
>> ===== ALIVE Mon Aug  5 17:56:26 UTC 2013
>> Erlang has closed
>> /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang 
>> has closed.
>> 
>> ===== Mon Aug  5 18:13:22 UTC 2013
>> ^M
>> Crash dump was written to: ./log/erl_crash.dump^M
>> eheap_alloc: Cannot allocate 934157120 bytes of memory (of type "heap").^M
>> 
>> with this initial crash, nothing is written to crash.log or error.log.  
>> There is a big honking erl_crash.dump file though...
>> 
>> *************************************
>> * on the next node:
>> *************************************
>> last lines of console.log
>> ----------------------
>> 2013-08-05 17:49:02.789 [info] 
>> <0.3014.550>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
>> active anti-entropy exchange of 
>> {91343852333181432387730302044767688728495783936,3} between 
>> {91343852333181432387730302044767688728495783936,riak@riak002} and 
>> {114179815416476790484662877555959610910619729920,riak@riak004}
>> 2013-08-05 18:00:19.830 [info] <0.61.0> alarm_handler: 
>> {set,{system_memory_high_watermark,[]}}
>> 2013-08-05 18:00:56.272 [info] 
>> <0.85.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
>> <0.7460.0> 
>> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{eleveldb,get,3}},{message_queue_len,6}]
>>  
>> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,40978885}]
>> 
>> in erlang.log:
>> ----------------------
>> ===== ALIVE Mon Aug  5 17:56:55 UTC 2013
>> /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang 
>> has closed.Erlang has closed
>> 
>> 
>> ===== Mon Aug  5 18:13:33 UTC 2013
>> ^M
>> Crash dump was written to: ./log/erl_crash.dump^M
>> eheap_alloc: Cannot allocate 934157120 bytes of memory (of type "heap").^M
>> 
>> *************************************
>> * on the next node:
>> *************************************
>> console.log looks similar to the first
>> 
>> nothing was logged in erlang.log and there was no crash dump.  Looks like 
>> the beam proc just crashed hard.
>> 
>> 
>> *************************************
>> * on the next node:
>> *************************************
>> last lines of console.log
>> ----------------------
>> console.log looks similar to the first
>> 
>> in erlang.log:
>> ----------------------
>> ===== Mon Aug  5 18:01:43 UTC 2013
>> Erlang has closed
>> /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang 
>> has closed.
>> ^M
>> Crash dump was written to: ./log/erl_crash.dump^M
>> eheap_alloc: Cannot allocate 2280657000 bytes of memory (of type "heap").^M
>> 
>> 
>> If I try to start things back up, the nodes keep periodically crashing, 
>> usually within a few minutes.  And I can't put any more data in without it 
>> blowing up...
>> 
>> Any help would be appreciated.
>> 
>> Paul Ingalls
>> Founder & CEO Fanzo
>> [email protected]<mailto:[email protected]>
>> @paulingalls
>> http://www.linkedin.com/in/paulingalls
>> 
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> [email protected]<mailto:[email protected]>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to