Hello Riak Folks,

The last three days, we've been having a string of problems with Riak. An 
otherwise healthy server running our full application stack will suddenly start 
throwing a bunch of errors in the logs. Although the Riak processes stay up, 
most or all requests to Riak fail during these periods.

The errors in the logs are predominantly describing "riak_kv_vnode worker pool 
crashed" and timeout conditions. This morning, we had this crashy behavior 
start immediately after a clean Riak startup, and making a single call to our 
API, so the logs are quite free of other noise. I've summarized those logs 
below for curious parties, and can attach the full set of logs if needed.

I forgot to check this morning, but during a similar outage on Monday, the Riak 
server was refusing connections to new clients.

Interestingly, after giving Riak a while with no traffic at all today (like 
15-30 minutes), it appears to have recovered without a restart. We've had 
similar recoveries during other "outages" of this type since Sunday evening. 
Facilitating this sort of recovery does seem to require shutting down all 
application KV requests for a while.

We're suspicious of some kind of corruption in the eleveldb on-disk files, 
because in past outages of this type, we've observed that the condition 
persists over reboots. But we don't have much more evidence than that. Is there 
a command we can run that will check over eleveldb files for corruption or 
inconsistency?

Other than that, what can cause "worker pool crashed" events? What do you know 
about the "timeouts" that are in these logs?

For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb backend 
with 512 partitions. We're running predominantly in a single-node configuration 
on a bunch of isolated dev boxes at the moment, on our way to spreading out our 
512 vnodes onto 5 hosts in production.

Thanks for your help,

Dave

--
Dave Lowell
[email protected]


2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node 
'[email protected]'

... normal startup messages ...

2012-11-07 18:11:50.109 [info] <0.10582.0>@riak_core:wait_for_application:419 
Wait complete for application riak_search (0 seconds)
2012-11-07 18:22:18.509 [error] <0.2897.0>@riak_core_vnode:handle_info:510 
105616329260241031198313161739262640092323250176 riak_kv_vnode worker pool 
crashed 
{timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state ready 
terminated with reason: 
{timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}

... 13 more "riak_kv_vnode worker pool crashed" messages...

2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process <0.2899.0> with 
0 neighbours exited with reason: 
{timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
 in gen_fsm:terminate/7 line 611
2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state ready 
terminated with reason: 
{timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}

... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with reason" and 
"gen_fsm <0.2989.0> in state ready terminated with reason" message pairs

2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> terminated 
with reason: 
{error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process <0.15322.0> 
with 0 neighbours exited with reason: 
{error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
 in gen_server:terminate/6 line 747
2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup had 
child undefined started with {riak_api_pb_server,start_link,undefined} at 
<0.15322.0> exit with reason 
{error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
 in context child_terminated
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to