Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the issue
down to us having several multiple-hundred-megabyte sized documents and one
1.1 gig document. Deletion of those documents has now kept the cluster
running quite happily for 3+ hours now, where before nodes were crashing
after 15 minutes.
I've managed to delete most of the large documents, but there are still a
handful (3) that I am unable to delete. Attempts to curl -X DELETE them
result in 503 error from Riak:
< HTTP/1.1 503 Service Unavailable
> < Server: MochiWeb/1.1 WebMachine/1.7.3 (participate in the frantic)
> < Date: Wed, 06 Jul 2011 04:20:15 GMT
> < Content-Type: text/plain
> < Content-Length: 18
<
> request timed out
In the erlang.log, I see this right before the timeout comes back:
=INFO REPORT==== 5-Jul-2011::21:26:35 ===
> [{alarm_handler,{set,{process_memory_high_watermark,<0.10425.0>}}}]
Anyone have any help/ideas on what's going on here and how to fix it?
On Tue, Jul 5, 2011 at 8:58 AM, Jeff Pollard <[email protected]> wrote:
> Over the last few days we've had random nodes in our 5-node cluster crash
> with "eheap_alloc: Cannot allocate xxxx bytes of memory" errors in the
> erl_crash.dump file. In general, the error messages seem to crash trying to
> allocate 13-20 gigs of memory (our boxes have 32 gigs total). As far as I
> can tell crashing doesn't seem to coincide with any particular requests to
> Riak. I've tried to make some sense fo the erl_crash.dump file but haven't
> had any luck. I'm also in the process of restoring our riak bakups to our
> staging cluster in hopes of more accurately reproducing the issue in a less
> noisy environment.
>
> My questions for the list are:
>
> 1. Any clue how to further diagnose the issue? I can attach my
> erl_crash.dump if needed.
> 2. Is it possible/likely this is due to large m/r requests? We have a
> couple m/r requests. One goes over no more than 4 documents at a time
> while
> the other goes over anywhere between 60 and 10,000 documents, though more
> towards the smaller number of documents. We use 16 js VMs with max memory
> for the VM and stack of 32 MB, each.
> 3. We're running riak 0.14.1. Would upgrading to 0.14.2 help?
>
> Thanks!
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com