Thanks Matthew. Yep, there are quite a few hits on 'waiting'. Interesting. I'll send the merged log separately.
Dave -- Dave Lowell [email protected] On Nov 14, 2012, at 10:43 AM, Matthew Von-Maszewski wrote: > Dave, > > Ok, heavy writes. Let's see if leveldb has hit one of its intentional > "stalls": > >> sort /var/db/riak/leveldb/*/LOG* | grep -i waiting > > See if that shows any indication of stall in the LOG files of leveldb. If > so, pick one server and send me a combined LOG file from that server: > > sort /var/db/riak/leveldb/*/LOG* >LOG.all > > That will tar zip really well. > > Matthew > > > On Nov 14, 2012, at 1:34 PM, David Lowell wrote: > >> Thanks Matthew. I've run both greps with no hits, unfortunately. >> >> A couple of details that I want to highlight. Since I first posted about >> this issue, we upgraded from Riak 1.2.0, to 1.2.1. Following that upgrade, >> we continue see these periods of instability with errors in the logs like >> "riak_kv_vnode worker pool crashed", but we started seeing lots of new error >> records in the logs about "Unrecognized message" as well. In both cases, we >> see tons of these "long_gc" monitoring messages and several >> "system_memory_high_watermark" alarms during these periods. The client also >> has connection problems such as timeouts and connections being refused. >> >> The logs from the last 18 hours on this server are really a mess, with very >> high levels of all of these errors. I'd be happy to send them along if you >> think that would help. >> >> It's also worth noting that our application is pretty write heavy, has lots >> of parallel processes generating those writes (so not a lot of flow control >> if Riak bogs down, at least not yet). It's probably pushing Riak fairly hard. >> >> Dave >> >> -- >> Dave Lowell >> [email protected] >> >> On Nov 14, 2012, at 8:51 AM, Matthew Von-Maszewski wrote: >> >>> Dave, >>> >>> Just getting my head back into the game. Was away for a few days. Random >>> thought, maybe there is a hard drive with a read problem. That can cause >>> issues similar to this. 1.2.1 does NOT percolate the read errors seen in >>> leveldb to riak-admin (yes, that should start to happen in 1.3). >>> >>> I will assume your leveldb "data_root" is "/var/db/riak/leveldb" for this >>> script. Please substitute your appropriate path from app.config and try >>> these two commands on each physical server (node): >>> >>> sort /var/db/riak/leveldb/*/LOG* | grep corrupted >>> sort /var/db/riak/leveldb/*/LOG* | grep checksum >>> >>> If you get hits on either, we have found the problem. >>> >>> The "LOCK file unavailable" is more of a statement about the internal >>> condition of the code instead of an error. The message is saying that the >>> first attempt to re-open a vnode failed because the prior instance is still >>> closing (or more likely waiting for Erlang's garbage collection to finish >>> destroying things). This message is new to 1.2 code base. >>> >>> Matthew >>> >>> >>> >>> >>> On Nov 7, 2012, at 6:56 PM, David Lowell wrote: >>> >>>> After further thought, I want to add more color to this issue. I >>>> hypothesize the symptoms I described here were continued fallout of an >>>> earlier crash. So I've waded further back into to the logs to try to shed >>>> light on how the Riak process was doing prior to this time. It was >>>> unhappy. It appears to have failed and stopped several times in the 1/2 >>>> day prior. The pattern revealed in the logs looks something like this: >>>> >>>> - Following a clean startup, the service runs for a while >>>> >>>> - Usually the first log entry of any kind near the beginning of a long >>>> block of error logs is "alarm_handler: >>>> {set,{system_memory_high_watermark,[]}}" >>>> ( Is this indicating excessive memory use? ) >>>> >>>> - Within a few minutes we see several log messages warning about >>>> "long_gc", which I assume is an indication that garbage collection took >>>> longer than some threshold >>>> >>>> - Within the next minute or two, we start to see the legions of errors, >>>> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of >>>> timeout when trying to communicate with the eleveldb backend >>>> >>>> - Eventually we see a log record indicating "Failed to start >>>> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily >>>> unavailable >>>> >>>> - Then Riak starts to exit: riak:stop:46 "backend module failed to >>>> start." >>>> >>>> >>>> So, not really knowing the Riak internals, it's a little difficult to >>>> piece together the story here. Could be we're running low on memory. Hard >>>> to know why riak_kv_workers are failing, or why this leveldb LOCK file is >>>> unavailable. To those more learned, do these log records tell a story? >>>> >>>> For the record, we're using the default 8 MB of leveldb cache per vnode, >>>> so that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 GB >>>> of physical memory. Are there other pieces of Riak that can use a lot of >>>> memory that we need to look out for? >>>> >>>> I'll include a few of the actual log records for reference, below. >>>> >>>> Dave >>>> >>>> -- >>>> Dave Lowell >>>> [email protected] >>>> >>>> Representative logs, many similar ones deleted for brevity: >>>> >>>> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: >>>> {set,{system_memory_high_watermark,[]}} >>>> 2012-11-07 01:55:50.517 [info] >>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc >>>> <0.18585.32> >>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}] >>>> >>>> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}] >>>> 2012-11-07 01:56:20.303 [error] <0.9231.0>@riak_core_vnode:handle_info:510 >>>> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker >>>> pool crashed >>>> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}} >>>> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state >>>> ready terminated with reason: >>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}} >>>> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 Failed >>>> to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock >>>> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK: >>>> Resource temporarily unavailable"} >>>> 2012-11-07 01:57:27.425 [info] >>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc >>>> <0.19181.32> >>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}] >>>> >>>> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}] >>>> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process >>>> <0.10002.0> with 0 neighbours exited with reason: >>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}} >>>> in gen_fsm:terminate/7 line >>>> >>>> >>>> On Nov 7, 2012, at 11:34 AM, David Lowell wrote: >>>> >>>>> Hello Riak Folks, >>>>> >>>>> The last three days, we've been having a string of problems with Riak. An >>>>> otherwise healthy server running our full application stack will suddenly >>>>> start throwing a bunch of errors in the logs. Although the Riak processes >>>>> stay up, most or all requests to Riak fail during these periods. >>>>> >>>>> The errors in the logs are predominantly describing "riak_kv_vnode worker >>>>> pool crashed" and timeout conditions. This morning, we had this crashy >>>>> behavior start immediately after a clean Riak startup, and making a >>>>> single call to our API, so the logs are quite free of other noise. I've >>>>> summarized those logs below for curious parties, and can attach the full >>>>> set of logs if needed. >>>>> >>>>> I forgot to check this morning, but during a similar outage on Monday, >>>>> the Riak server was refusing connections to new clients. >>>>> >>>>> Interestingly, after giving Riak a while with no traffic at all today >>>>> (like 15-30 minutes), it appears to have recovered without a restart. >>>>> We've had similar recoveries during other "outages" of this type since >>>>> Sunday evening. Facilitating this sort of recovery does seem to require >>>>> shutting down all application KV requests for a while. >>>>> >>>>> We're suspicious of some kind of corruption in the eleveldb on-disk >>>>> files, because in past outages of this type, we've observed that the >>>>> condition persists over reboots. But we don't have much more evidence >>>>> than that. Is there a command we can run that will check over eleveldb >>>>> files for corruption or inconsistency? >>>>> >>>>> Other than that, what can cause "worker pool crashed" events? What do you >>>>> know about the "timeouts" that are in these logs? >>>>> >>>>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04, eleveldb >>>>> backend with 512 partitions. We're running predominantly in a single-node >>>>> configuration on a bunch of isolated dev boxes at the moment, on our way >>>>> to spreading out our 512 vnodes onto 5 hosts in production. >>>>> >>>>> Thanks for your help, >>>>> >>>>> Dave >>>>> >>>>> -- >>>>> Dave Lowell >>>>> [email protected] >>>>> >>>>> >>>>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node >>>>> '[email protected]' >>>>> >>>>> ... normal startup messages ... >>>>> >>>>> 2012-11-07 18:11:50.109 [info] >>>>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for >>>>> application riak_search (0 seconds) >>>>> 2012-11-07 18:22:18.509 [error] >>>>> <0.2897.0>@riak_core_vnode:handle_info:510 >>>>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker >>>>> pool crashed >>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}} >>>>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state >>>>> ready terminated with reason: >>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}} >>>>> >>>>> ... 13 more "riak_kv_vnode worker pool crashed" messages... >>>>> >>>>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process >>>>> <0.2899.0> with 0 neighbours exited with reason: >>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}} >>>>> in gen_fsm:terminate/7 line 611 >>>>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state >>>>> ready terminated with reason: >>>>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}} >>>>> >>>>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with >>>>> reason" and "gen_fsm <0.2989.0> in state ready terminated with reason" >>>>> message pairs >>>>> >>>>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> >>>>> terminated with reason: >>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...} >>>>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process >>>>> <0.15322.0> with 0 neighbours exited with reason: >>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...} >>>>> in gen_server:terminate/6 line 747 >>>>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup >>>>> had child undefined started with >>>>> {riak_api_pb_server,start_link,undefined} at <0.15322.0> exit with reason >>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...} >>>>> in context child_terminated >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> [email protected] >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
