Hi Marcus, So first question (and I apologize if I missed this): what version of Riak are you running? There was a fix in 1.2.1 that address an issue with level's locking logic [0] that looks similar to what you're experiencing.
As far as u-limit, you should be cranking that much higher. Our docs (which you may have already read) have some details on how and why to set it pretty much as high as your system will allow. Mark [0] https://github.com/basho/riak_kv/pull/395 [1] http://docs.basho.com/riak/latest/cookbooks/Open-Files-Limit/ On Tue, Nov 20, 2012 at 2:55 PM, Marcus Baguley <[email protected]> wrote: > Hello group > > We are experimenting with an app running over 3 nodes. The app is under > fairly constant write from 2 or 3 writing threads for 6 hours per day and a > moderate amount of read requests. We have experienced several crashes that > we would like some explanation of - we are hoping it is some configuration > issue resulting in an easy fix and give us some stability :) > > After a node crashes, it can often take several attempts to restart it, IO > and CPU goes high. We have found that if we halt any read/writes during the > restart - then the node is more likely to come back and work through the > high load process of the node being bought back to life. > > The two errors that show up in the logs are: riak_kv_vnode worker pool > crashed... timeout > > 2012-11-11 06:37:16.105 UTC [error] > <0.5037.0>@riak_core_vnode:handle_info:510 > 645115957103093866238345258191171801645001474048 riak_kv_vnode worker pool > crashed > {timeout,{gen_fsm,sync_send_event,[<0.5040.0>,{checkout,false,5000},5000]}} > 2012-11-11 06:37:16.106 UTC [error] > <0.5083.0>@riak_core_vnode:handle_info:510 > > > Followed or accompanied by: Resource temporarily unavailable > > 2012-11-11 06:37:18.015 UTC [error] <0.19495.424>@riak_kv_vnode:init:265 > Failed to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock > /var/lib/riak/leveldb/576608067853207791947547531657596035098629636096/LOCK: > Resource temporarily unavailable"} > > I have read reference that ulimit should be increased from the above error - > what should this be set to if our limit of 4096 is too low (is there any > formular based on number of vnodes etc?). > > More log file context and our app.config is contained here: > > https://docs.google.com/folder/d/0B5dwJ114R8NzQTZEZ2VnMWJxQWM/edit > > Our configuration: > ring_creation_size: 256 > Physical Nodes: 3 > ulimit -n 4096 > > ubuntu 10.04 > vm.swappiness = 0 > Disk is on SAN, formatted as ext4 mounted with > "noatime,barrier=0,data=writeback" options > using deadline scheduler > > Best regards > > Marcus > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
