Re: erlang go boom

Kresten Krab Thorup Tue, 06 Aug 2013 00:38:07 -0700

The ring state looks OK; the ring does not look polluted with random state, the 
strange thing is why the get_fsm process <0.83.0> has a +100M heap.  Would be 
interesting to figure out what's on that heap; which you can learn from the 
crash dump.


Perhaps you can load the crash dump into the CrashDumpViewer [start it by 
typing crashdump_viewer:start() in a fresh Erlang 
VM](http://www.erlang.org/doc/apps/observer/crashdump_ug.html) and then go to 
the process named <0.83.0>, and get a stack dump.  That could maybe give a hint 
to what is using all that memory.

Kresten

Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab
Trifork A/S  |  Margrethepladsen 4  | DK- 8000 Aarhus C |  Phone : +45 8732 
8787  |  www.trifork.com<http://www.trifork.com>
[cid:A74426A3-6403-45D7-BF2C-B6FD4CA24662]

On Aug 5, 2013, at 11:58 PM, Paul Ingalls <[email protected]<mailto:[email protected]>> 
wrote:

Hey Kresten,

Thanks for the response!

I learned my lesson on setting bucket properties.  So all buckets currently use 
the defaults.

here is the output from one of our nodes:

total 40
drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
-rw-r--r-- 1 root root 14187 Aug  4 17:37 riak_core_ring.default.20130804173753
-rw-r--r-- 1 root root 14586 Aug  5 21:10 riak_core_ring.default.20130805211043

another node:

total 72
drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
-rw-r--r-- 1 root root 14187 Aug  4 17:37 riak_core_ring.default.20130804173753
-rw-r--r-- 1 root root 14358 Aug  5 20:56 riak_core_ring.default.20130805205650
-rw-r--r-- 1 root root 14529 Aug  5 21:09 riak_core_ring.default.20130805210924
-rw-r--r-- 1 root root 14586 Aug  5 21:10 riak_core_ring.default.20130805211043

looks like the largest number of files in that directory on any node is 5

Paul

Paul Ingalls
Founder & CEO Fanzo
[email protected]<mailto:[email protected]>
@paulingalls
http://www.linkedin.com/in/paulingalls



On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup <[email protected]> wrote:

I'd think the large #buckets could be the issue; especially if there is any 
bucket properties being set, because that would cause the ring data structure 
to be enormous.

Could you provide an "ls -l" output of the riak data/ring directory?

Sent from my iPhone

On 05/08/2013, at 21.52, "Paul Ingalls" <[email protected]<mailto:[email protected]>> 
wrote:

As promised in previous email, I hit a fairly big problem over the weekend and 
then reproduced it this morning and I was wondering if I could get some help.

Basically, I was running my code against our risk cluster and everything was 
moving along just fine.  However, at some point Riak just seems to hit a wall.  
I get to a certain scale of content, about 9-10 GB per node, and about half the 
cluster gives up the ghost.

I've done this twice from scratch now, the first time I thought maybe I was 
trying to push too many transactions per second.  But the second time I reduced 
the speed and changed some settings in the app.config that I thought would 
reduce memory usage.  Ran it again, and 24 hrs later I still hit the wall 
(almost the same place).

So, I figure I'm still doing something wrong, I'm just not sure what.  I'm 
obviously hitting some kind of heap issue, but I'm not sure what else I can do 
about it.  I'm hoping there is something obvious I'm missing in my ignorance, 
cuz at the moment I'm stuck...

Some details:

7 node cluster running 1.4
each VM has 7GB RAM and 4 CPUs
the data directory is on a RAID0 with 750GB of space
128 partitions, levelDB backend
using links and secondary indexes
lots of buckets (over 10 million), a couple buckets have lots of keys (one was 
around 6.6 million keys when it crashed, the other around 3.7 million).
Values are pretty small, almost all are just a few bytes.  There is one bucket, 
the largest (6.6 million keys), with value sizes between 1-2k.

primary custom app config settings:
{kernel,
          [
              {inet_dist_listen_min, 6000},
              {inet_dist_listen_max, 7999}
          ]},

{riak_core, [
              {default_bucket_props, [
                    {allow_mult, true},
                    {r, 1},
                    {w, 1},
                    {dw, 1},
                    {dw, 1}
                   ]},
             {ring_creation_size, 128},
]},

{riak_kv, [
           {storage_backend, riak_kv_eleveldb_backend},
]}

{eleveldb, [
            {max_open_files, 32}
           ]},

custom vm.args settings
+A 16

Here is some of the error information I am seeing when it all goes boom:

*************************************
* On one of the nodes that crashes:
************************************
last lines of console.log
-------------------
2013-08-05 17:59:08.071 [info] 
<0.29251.556>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
active anti-entropy exchange of 
{468137243207554840987117797979434404733540892672,3} between 
{479555224749202520035584085735030365824602865664,riak@riak001} and 
{490973206290850199084050373490626326915664838656,riak@riak002}
2013-08-05 18:01:10.234 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}]
2013-08-05 18:01:10.672 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.12133.557> 
[{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}]
 
[{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,40978360}]
2013-08-05 18:01:12.993 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}]
2013-08-05 18:01:13.816 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}]
2013-08-05 18:01:13.819 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.12133.557> 
[{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,pack,5}},{message_queue_len,0}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,200971},{old_heap_size,0},{heap_size,47627275}]
2013-08-05 18:01:14.594 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,49},{old_heap_size,0},{heap_size,81956796}]
2013-08-05 18:01:15.729 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.12133.557> 
[{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,encode,2}},{message_queue_len,0}]
 
[{old_heap_block_size,0},{heap_block_size,74732575},{mbuf_size,0},{stack_size,81},{old_heap_size,0},{heap_size,42778355}]
2013-08-05 18:01:15.878 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,49},{old_heap_size,0},{heap_size,81956794}]
2013-08-05 18:01:15.878 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,41},{old_heap_size,0},{heap_size,81956788}]
2013-08-05 18:01:15.878 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956774}]
2013-08-05 18:01:15.878 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.14832.557> 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]
2013-08-05 18:01:15.880 [info] 
<0.83.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
<0.12133.557> 
[{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{lists,reverse,1}},{message_queue_len,0}]
 
[{old_heap_block_size,0},{heap_block_size,74732575},{mbuf_size,0},{stack_size,71},{old_heap_size,0},{heap_size,69315695}]


in erlang.log:
----------------------
===== ALIVE Mon Aug  5 17:56:26 UTC 2013
Erlang has closed
/home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang has 
closed.

===== Mon Aug  5 18:13:22 UTC 2013
^M
Crash dump was written to: ./log/erl_crash.dump^M
eheap_alloc: Cannot allocate 934157120 bytes of memory (of type "heap").^M

with this initial crash, nothing is written to crash.log or error.log.  There 
is a big honking erl_crash.dump file though...

*************************************
* on the next node:
*************************************
last lines of console.log
----------------------
2013-08-05 17:49:02.789 [info] 
<0.3014.550>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
active anti-entropy exchange of 
{91343852333181432387730302044767688728495783936,3} between 
{91343852333181432387730302044767688728495783936,riak@riak002} and 
{114179815416476790484662877555959610910619729920,riak@riak004}
2013-08-05 18:00:19.830 [info] <0.61.0> alarm_handler: 
{set,{system_memory_high_watermark,[]}}
2013-08-05 18:00:56.272 [info] 
<0.85.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.7460.0> 
[{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{eleveldb,get,3}},{message_queue_len,6}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,40978885}]

in erlang.log:
----------------------
===== ALIVE Mon Aug  5 17:56:55 UTC 2013
/home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang has 
closed.Erlang has closed


===== Mon Aug  5 18:13:33 UTC 2013
^M
Crash dump was written to: ./log/erl_crash.dump^M
eheap_alloc: Cannot allocate 934157120 bytes of memory (of type "heap").^M

*************************************
* on the next node:
*************************************
console.log looks similar to the first

nothing was logged in erlang.log and there was no crash dump.  Looks like the 
beam proc just crashed hard.


*************************************
* on the next node:
*************************************
last lines of console.log
----------------------
console.log looks similar to the first

in erlang.log:
----------------------
===== Mon Aug  5 18:01:43 UTC 2013
Erlang has closed
/home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang has 
closed.
^M
Crash dump was written to: ./log/erl_crash.dump^M
eheap_alloc: Cannot allocate 2280657000 bytes of memory (of type "heap").^M


If I try to start things back up, the nodes keep periodically crashing, usually 
within a few minutes.  And I can't put any more data in without it blowing up...

Any help would be appreciated.

Paul Ingalls
Founder & CEO Fanzo
[email protected]<mailto:[email protected]>
@paulingalls
http://www.linkedin.com/in/paulingalls



_______________________________________________
riak-users mailing list
[email protected]<mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

<<inline: DD800694-CFE6-44F1-858A-2F3D6042CB73[84].png>>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: erlang go boom

Reply via email to