Hello Riak Folks,
The last three days, we've been having a string of problems with Riak. An
otherwise healthy server running our full application stack will suddenly start
throwing a bunch of errors in the logs. Although the Riak processes stay up,
most or all requests to Riak fail during these periods.
The errors in the logs are predominantly describing "riak_kv_vnode worker pool
crashed" and timeout conditions. This morning, we had this crashy behavior
start immediately after a clean Riak startup, and making a single call to our
API, so the logs are quite free of other noise. I've summarized those logs
below for curious parties, and can attach the full set of logs if needed.
I forgot to check this morning, but during a similar outage on Monday, the Riak
server was refusing connections to new clients.
Interestingly, after giving Riak a while with no traffic at all today (like
15-30 minutes), it appears to have recovered without a restart. We've had
similar recoveries during other "outages" of this type since Sunday evening.
Facilitating this sort of recovery does seem to require shutting down all
application KV requests for a while.
We're suspicious of some kind of corruption in the eleveldb on-disk files,
because in past outages of this type, we've observed that the condition
persists over reboots. But we don't have much more evidence than that. Is there
a command we can run that will check over eleveldb files for corruption or
inconsistency?
Other than that, what can cause "worker pool crashed" events? What do you know
about the "timeouts" that are in these logs?
For the record, we're running Riak 1.2.0 on Ubuntu 10.04, eleveldb backend
with 512 partitions. We're running predominantly in a single-node configuration
on a bunch of isolated dev boxes at the moment, on our way to spreading out our
512 vnodes onto 5 hosts in production.
Thanks for your help,
Dave
--
Dave Lowell
[email protected]
2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node
'[email protected]'
... normal startup messages ...
2012-11-07 18:11:50.109 [info] <0.10582.0>@riak_core:wait_for_application:419
Wait complete for application riak_search (0 seconds)
2012-11-07 18:22:18.509 [error] <0.2897.0>@riak_core_vnode:handle_info:510
105616329260241031198313161739262640092323250176 riak_kv_vnode worker pool
crashed
{timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state ready
terminated with reason:
{timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
... 13 more "riak_kv_vnode worker pool crashed" messages...
2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process <0.2899.0> with
0 neighbours exited with reason:
{timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
in gen_fsm:terminate/7 line 611
2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state ready
terminated with reason:
{timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with reason" and
"gen_fsm <0.2989.0> in state ready terminated with reason" message pairs
2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> terminated
with reason:
{error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process <0.15322.0>
with 0 neighbours exited with reason:
{error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
in gen_server:terminate/6 line 747
2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup had
child undefined started with {riak_api_pb_server,start_link,undefined} at
<0.15322.0> exit with reason
{error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
in context child_terminated_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com