I watched top on all the instances when things started to fall apart. This is what I saw…
Everything was jamming along just fine. CPU usage was about 25%, ram usage was about 25% (3 of the 7 were at about 15%). Suddenly, CPU usage spikes to over 50% and ram usage spikes to 80-90% (and I'm guessing on those nodes that crash, it hit over 100%). This happens within seconds, its not a gradual growth in RAM or CPU but a spike. On those nodes that survived, they stayed at this higher water mark for RAM (and CPU keeps running at the incremental difference since I killed the client). Is there any easy way to figure out what the process is working on? Could compaction cause this? Basically I have hit a place where it needs to merge SSTables to a higher level, and there isn't enough RAM to pull the data set into memory? Paul Paul Ingalls Founder & CEO Fanzo [email protected] @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 2:58 PM, Paul Ingalls <[email protected]> wrote: > Hey Kresten, > > Thanks for the response! > > I learned my lesson on setting bucket properties. So all buckets currently > use the defaults. > > here is the output from one of our nodes: > > total 40 > drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ > drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ > -rw-r--r-- 1 root root 14187 Aug 4 17:37 > riak_core_ring.default.20130804173753 > -rw-r--r-- 1 root root 14586 Aug 5 21:10 > riak_core_ring.default.20130805211043 > > another node: > > total 72 > drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ > drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ > -rw-r--r-- 1 root root 14187 Aug 4 17:37 > riak_core_ring.default.20130804173753 > -rw-r--r-- 1 root root 14358 Aug 5 20:56 > riak_core_ring.default.20130805205650 > -rw-r--r-- 1 root root 14529 Aug 5 21:09 > riak_core_ring.default.20130805210924 > -rw-r--r-- 1 root root 14586 Aug 5 21:10 > riak_core_ring.default.20130805211043 > > looks like the largest number of files in that directory on any node is 5 > > Paul > > Paul Ingalls > Founder & CEO Fanzo > [email protected] > @paulingalls > http://www.linkedin.com/in/paulingalls > > > > On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup <[email protected]> wrote: > >> I'd think the large #buckets could be the issue; especially if there is any >> bucket properties being set, because that would cause the ring data >> structure to be enormous. >> >> Could you provide an "ls -l" output of the riak data/ring directory? >> >> Sent from my iPhone >> >> On 05/08/2013, at 21.52, "Paul Ingalls" >> <[email protected]<mailto:[email protected]>> wrote: >> >> As promised in previous email, I hit a fairly big problem over the weekend >> and then reproduced it this morning and I was wondering if I could get some >> help. >> >> Basically, I was running my code against our risk cluster and everything was >> moving along just fine. However, at some point Riak just seems to hit a >> wall. I get to a certain scale of content, about 9-10 GB per node, and >> about half the cluster gives up the ghost. >> >> I've done this twice from scratch now, the first time I thought maybe I was >> trying to push too many transactions per second. But the second time I >> reduced the speed and changed some settings in the app.config that I thought >> would reduce memory usage. Ran it again, and 24 hrs later I still hit the >> wall (almost the same place). >> >> So, I figure I'm still doing something wrong, I'm just not sure what. I'm >> obviously hitting some kind of heap issue, but I'm not sure what else I can >> do about it. I'm hoping there is something obvious I'm missing in my >> ignorance, cuz at the moment I'm stuck... >> >> Some details: >> >> 7 node cluster running 1.4 >> each VM has 7GB RAM and 4 CPUs >> the data directory is on a RAID0 with 750GB of space >> 128 partitions, levelDB backend >> using links and secondary indexes >> lots of buckets (over 10 million), a couple buckets have lots of keys (one >> was around 6.6 million keys when it crashed, the other around 3.7 million). >> Values are pretty small, almost all are just a few bytes. There is one >> bucket, the largest (6.6 million keys), with value sizes between 1-2k. >> >> primary custom app config settings: >> {kernel, >> [ >> {inet_dist_listen_min, 6000}, >> {inet_dist_listen_max, 7999} >> ]}, >> >> {riak_core, [ >> {default_bucket_props, [ >> {allow_mult, true}, >> {r, 1}, >> {w, 1}, >> {dw, 1}, >> {dw, 1} >> ]}, >> {ring_creation_size, 128}, >> ]}, >> >> {riak_kv, [ >> {storage_backend, riak_kv_eleveldb_backend}, >> ]} >> >> {eleveldb, [ >> {max_open_files, 32} >> ]}, >> >> custom vm.args settings >> +A 16 >> >> Here is some of the error information I am seeing when it all goes boom: >> >> ************************************* >> * On one of the nodes that crashes: >> ************************************ >> last lines of console.log >> ------------------- >> 2013-08-05 17:59:08.071 [info] >> <0.29251.556>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during >> active anti-entropy exchange of >> {468137243207554840987117797979434404733540892672,3} between >> {479555224749202520035584085735030365824602865664,riak@riak001} and >> {490973206290850199084050373490626326915664838656,riak@riak002} >> 2013-08-05 18:01:10.234 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}] >> 2013-08-05 18:01:10.672 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.12133.557> >> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}] >> >> [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,40978360}] >> 2013-08-05 18:01:12.993 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}] >> 2013-08-05 18:01:13.816 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}] >> 2013-08-05 18:01:13.819 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.12133.557> >> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,pack,5}},{message_queue_len,0}] >> >> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,200971},{old_heap_size,0},{heap_size,47627275}] >> 2013-08-05 18:01:14.594 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,49},{old_heap_size,0},{heap_size,81956796}] >> 2013-08-05 18:01:15.729 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.12133.557> >> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,encode,2}},{message_queue_len,0}] >> >> [{old_heap_block_size,0},{heap_block_size,74732575},{mbuf_size,0},{stack_size,81},{old_heap_size,0},{heap_size,42778355}] >> 2013-08-05 18:01:15.878 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,49},{old_heap_size,0},{heap_size,81956794}] >> 2013-08-05 18:01:15.878 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,41},{old_heap_size,0},{heap_size,81956788}] >> 2013-08-05 18:01:15.878 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956774}] >> 2013-08-05 18:01:15.878 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.14832.557> >> [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] >> >> [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}] >> 2013-08-05 18:01:15.880 [info] >> <0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.12133.557> >> [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{lists,reverse,1}},{message_queue_len,0}] >> >> [{old_heap_block_size,0},{heap_block_size,74732575},{mbuf_size,0},{stack_size,71},{old_heap_size,0},{heap_size,69315695}] >> >> >> in erlang.log: >> ---------------------- >> ===== ALIVE Mon Aug 5 17:56:26 UTC 2013 >> Erlang has closed >> /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang >> has closed. >> >> ===== Mon Aug 5 18:13:22 UTC 2013 >> ^M >> Crash dump was written to: ./log/erl_crash.dump^M >> eheap_alloc: Cannot allocate 934157120 bytes of memory (of type "heap").^M >> >> with this initial crash, nothing is written to crash.log or error.log. >> There is a big honking erl_crash.dump file though... >> >> ************************************* >> * on the next node: >> ************************************* >> last lines of console.log >> ---------------------- >> 2013-08-05 17:49:02.789 [info] >> <0.3014.550>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during >> active anti-entropy exchange of >> {91343852333181432387730302044767688728495783936,3} between >> {91343852333181432387730302044767688728495783936,riak@riak002} and >> {114179815416476790484662877555959610910619729920,riak@riak004} >> 2013-08-05 18:00:19.830 [info] <0.61.0> alarm_handler: >> {set,{system_memory_high_watermark,[]}} >> 2013-08-05 18:00:56.272 [info] >> <0.85.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap >> <0.7460.0> >> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{eleveldb,get,3}},{message_queue_len,6}] >> >> [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,40978885}] >> >> in erlang.log: >> ---------------------- >> ===== ALIVE Mon Aug 5 17:56:55 UTC 2013 >> /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang >> has closed.Erlang has closed >> >> >> ===== Mon Aug 5 18:13:33 UTC 2013 >> ^M >> Crash dump was written to: ./log/erl_crash.dump^M >> eheap_alloc: Cannot allocate 934157120 bytes of memory (of type "heap").^M >> >> ************************************* >> * on the next node: >> ************************************* >> console.log looks similar to the first >> >> nothing was logged in erlang.log and there was no crash dump. Looks like >> the beam proc just crashed hard. >> >> >> ************************************* >> * on the next node: >> ************************************* >> last lines of console.log >> ---------------------- >> console.log looks similar to the first >> >> in erlang.log: >> ---------------------- >> ===== Mon Aug 5 18:01:43 UTC 2013 >> Erlang has closed >> /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang >> has closed. >> ^M >> Crash dump was written to: ./log/erl_crash.dump^M >> eheap_alloc: Cannot allocate 2280657000 bytes of memory (of type "heap").^M >> >> >> If I try to start things back up, the nodes keep periodically crashing, >> usually within a few minutes. And I can't put any more data in without it >> blowing up... >> >> Any help would be appreciated. >> >> Paul Ingalls >> Founder & CEO Fanzo >> [email protected]<mailto:[email protected]> >> @paulingalls >> http://www.linkedin.com/in/paulingalls >> >> >> >> _______________________________________________ >> riak-users mailing list >> [email protected]<mailto:[email protected]> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
