After further thought, I want to add more color to this issue. I hypothesize
the symptoms I described here were continued fallout of an earlier crash. So
I've waded further back into to the logs to try to shed light on how the Riak
process was doing prior to this time. It was unhappy. It appears to have failed
and stopped several times in the 1/2 day prior. The pattern revealed in the
logs looks something like this:
- Following a clean startup, the service runs for a while
- Usually the first log entry of any kind near the beginning of a long block of
error logs is "alarm_handler: {set,{system_memory_high_watermark,[]}}"
( Is this indicating excessive memory use? )
- Within a few minutes we see several log messages warning about "long_gc",
which I assume is an indication that garbage collection took longer than some
threshold
- Within the next minute or two, we start to see the legions of errors,
"riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of timeout
when trying to communicate with the eleveldb backend
- Eventually we see a log record indicating "Failed to start riak_kv_eleveldb"
because of a leveldb LOCK file being temporarily unavailable
- Then Riak starts to exit: riak:stop:46 "backend module failed to start."
So, not really knowing the Riak internals, it's a little difficult to piece
together the story here. Could be we're running low on memory. Hard to know why
riak_kv_workers are failing, or why this leveldb LOCK file is unavailable. To
those more learned, do these log records tell a story?
For the record, we're using the default 8 MB of leveldb cache per vnode, so
that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 GB of
physical memory. Are there other pieces of Riak that can use a lot of memory
that we need to look out for?
I'll include a few of the actual log records for reference, below.
Dave
--
Dave Lowell
[email protected]
Representative logs, many similar ones deleted for brevity:
2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler:
{set,{system_memory_high_watermark,[]}}
2012-11-07 01:55:50.517 [info]
<0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.18585.32>
[{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
[{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
2012-11-07 01:56:20.303 [error] <0.9231.0>@riak_core_vnode:handle_info:510
1258832464966656615093408225054454710289582522368 riak_kv_vnode worker pool
crashed
{timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state ready
terminated with reason:
{timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 Failed to
start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock
/var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
Resource temporarily unavailable"}
2012-11-07 01:57:27.425 [info]
<0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.19181.32>
[{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
[{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process <0.10002.0>
with 0 neighbours exited with reason:
{timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
in gen_fsm:terminate/7 line
On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
> Hello Riak Folks,
>
> The last three days, we've been having a string of problems with Riak. An
> otherwise healthy server running our full application stack will suddenly
> start throwing a bunch of errors in the logs. Although the Riak processes
> stay up, most or all requests to Riak fail during these periods.
>
> The errors in the logs are predominantly describing "riak_kv_vnode worker
> pool crashed" and timeout conditions. This morning, we had this crashy
> behavior start immediately after a clean Riak startup, and making a single
> call to our API, so the logs are quite free of other noise. I've summarized
> those logs below for curious parties, and can attach the full set of logs if
> needed.
>
> I forgot to check this morning, but during a similar outage on Monday, the
> Riak server was refusing connections to new clients.
>
> Interestingly, after giving Riak a while with no traffic at all today (like
> 15-30 minutes), it appears to have recovered without a restart. We've had
> similar recoveries during other "outages" of this type since Sunday evening.
> Facilitating this sort of recovery does seem to require shutting down all
> application KV requests for a while.
>
> We're suspicious of some kind of corruption in the eleveldb on-disk files,
> because in past outages of this type, we've observed that the condition
> persists over reboots. But we don't have much more evidence than that. Is
> there a command we can run that will check over eleveldb files for corruption
> or inconsistency?
>
> Other than that, what can cause "worker pool crashed" events? What do you
> know about the "timeouts" that are in these logs?
>
> For the record, we're running Riak 1.2.0 on Ubuntu 10.04, eleveldb backend
> with 512 partitions. We're running predominantly in a single-node
> configuration on a bunch of isolated dev boxes at the moment, on our way to
> spreading out our 512 vnodes onto 5 hosts in production.
>
> Thanks for your help,
>
> Dave
>
> --
> Dave Lowell
> [email protected]
>
>
> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node
> '[email protected]'
>
> ... normal startup messages ...
>
> 2012-11-07 18:11:50.109 [info] <0.10582.0>@riak_core:wait_for_application:419
> Wait complete for application riak_search (0 seconds)
> 2012-11-07 18:22:18.509 [error] <0.2897.0>@riak_core_vnode:handle_info:510
> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker pool
> crashed
> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state ready
> terminated with reason:
> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>
> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>
> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process <0.2899.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
> in gen_fsm:terminate/7 line 611
> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state ready
> terminated with reason:
> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
>
> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with reason"
> and "gen_fsm <0.2989.0> in state ready terminated with reason" message pairs
>
> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> terminated
> with reason:
> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process <0.15322.0>
> with 0 neighbours exited with reason:
> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
> in gen_server:terminate/6 line 747
> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup had
> child undefined started with {riak_api_pb_server,start_link,undefined} at
> <0.15322.0> exit with reason
> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
> in context child_terminated
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com