Re: mysterious Riak problems

Mark Phillips Wed, 07 Nov 2012 21:09:55 -0800

Hmm. Thanks for the extensive info. We're looking into this. Give us a few 
hours to do some theorizing. You'll hear     from us tomorrow unless another 
enterprising list member has thoughts here.


 

On Nov 7, 2012, at 3:56 PM, David Lowell <[email protected]> wrote:

> After further thought, I want to add more color to this issue. I hypothesize 
> the symptoms I described here were continued fallout of an earlier crash. So 
> I've waded further back into to the logs to try to shed light on how the Riak 
> process was doing prior to this time. It was unhappy. It appears to have 
> failed and stopped several times in the 1/2 day prior. The pattern revealed 
> in the logs looks something like this:
> 
> - Following  a clean startup, the service runs for a while
> 
> - Usually the first log entry of any kind near the beginning of a long block 
> of error logs is "alarm_handler: {set,{system_memory_high_watermark,[]}}"  
>       ( Is this indicating excessive memory use? )
> 
> - Within a few minutes we see several log messages warning about "long_gc", 
> which I assume is an indication that garbage collection took longer than some 
> threshold
> 
> - Within the next minute or two, we start to see the legions of errors, 
> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of 
> timeout when trying to communicate with the eleveldb backend
> 
> - Eventually we see a log record indicating "Failed to start 
> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily unavailable
> 
> - Then Riak starts to exit:  riak:stop:46 "backend module failed to start." 
> 
> 
> So, not really knowing the Riak internals, it's a little difficult to piece 
> together the story here. Could be we're running low on memory. Hard to know 
> why riak_kv_workers are failing, or why this leveldb LOCK file is 
> unavailable. To those more learned, do these log records tell a story?
> 
> For the record, we're using the default 8 MB of leveldb cache per vnode, so 
> that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 GB of 
> physical memory. Are there other pieces of Riak that can use a lot of memory 
> that we need to look out for?
> 
> I'll include a few of the actual log records for reference, below.
> 
> Dave
> 
> --
> Dave Lowell
> [email protected]
> 
> Representative logs, many similar ones deleted for brevity:
> 
> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: 
> {set,{system_memory_high_watermark,[]}}
> 2012-11-07 01:55:50.517 [info] 
> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
> <0.18585.32> 
> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
>  
> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
> 2012-11-07 01:56:20.303 [error] <0.9231.0>@riak_core_vnode:handle_info:510 
> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker pool 
> crashed 
> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state 
> ready terminated with reason: 
> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 Failed to 
> start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock 
> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
>  Resource temporarily unavailable"}
> 2012-11-07 01:57:27.425 [info] 
> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
> <0.19181.32> 
> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
>  
> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process <0.10002.0> 
> with 0 neighbours exited with reason: 
> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>  in gen_fsm:terminate/7 line 
> 
> 
> On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
> 
>> Hello Riak Folks,
>> 
>> The last three days, we've been having a string of problems with Riak. An 
>> otherwise healthy server running our full application stack will suddenly 
>> start throwing a bunch of errors in the logs. Although the Riak processes 
>> stay up, most or all requests to Riak fail during these periods.
>> 
>> The errors in the logs are predominantly describing "riak_kv_vnode worker 
>> pool crashed" and timeout conditions. This morning, we had this crashy 
>> behavior start immediately after a clean Riak startup, and making a single 
>> call to our API, so the logs are quite free of other noise. I've summarized 
>> those logs below for curious parties, and can attach the full set of logs if 
>> needed.
>> 
>> I forgot to check this morning, but during a similar outage on Monday, the 
>> Riak server was refusing connections to new clients.
>> 
>> Interestingly, after giving Riak a while with no traffic at all today (like 
>> 15-30 minutes), it appears to have recovered without a restart. We've had 
>> similar recoveries during other "outages" of this type since Sunday evening. 
>> Facilitating this sort of recovery does seem to require shutting down all 
>> application KV requests for a while.
>> 
>> We're suspicious of some kind of corruption in the eleveldb on-disk files, 
>> because in past outages of this type, we've observed that the condition 
>> persists over reboots. But we don't have much more evidence than that. Is 
>> there a command we can run that will check over eleveldb files for 
>> corruption or inconsistency?
>> 
>> Other than that, what can cause "worker pool crashed" events? What do you 
>> know about the "timeouts" that are in these logs?
>> 
>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb backend 
>> with 512 partitions. We're running predominantly in a single-node 
>> configuration on a bunch of isolated dev boxes at the moment, on our way to 
>> spreading out our 512 vnodes onto 5 hosts in production.
>> 
>> Thanks for your help,
>> 
>> Dave
>> 
>> --
>> Dave Lowell
>> [email protected]
>> 
>> 
>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node 
>> '[email protected]'
>> 
>> ... normal startup messages ...
>> 
>> 2012-11-07 18:11:50.109 [info] 
>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for application 
>> riak_search (0 seconds)
>> 2012-11-07 18:22:18.509 [error] <0.2897.0>@riak_core_vnode:handle_info:510 
>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker pool 
>> crashed 
>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state ready 
>> terminated with reason: 
>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>> 
>> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>> 
>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process <0.2899.0> 
>> with 0 neighbours exited with reason: 
>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>  in gen_fsm:terminate/7 line 611
>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state ready 
>> terminated with reason: 
>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
>> 
>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with reason" 
>> and "gen_fsm <0.2989.0> in state ready terminated with reason" message pairs
>> 
>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> 
>> terminated with reason: 
>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process <0.15322.0> 
>> with 0 neighbours exited with reason: 
>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>  in gen_server:terminate/6 line 747
>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup had 
>> child undefined started with {riak_api_pb_server,start_link,undefined} at 
>> <0.15322.0> exit with reason 
>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>  in context child_terminated
> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: mysterious Riak problems

Reply via email to