Re: mysterious Riak problems

Matthew Von-Maszewski Wed, 14 Nov 2012 08:51:30 -0800

Dave,

Just getting my head back into the game.  Was away for a few days.  Random 
thought, maybe there is a hard drive with a read problem.  That can cause 
issues similar to this.  1.2.1 does NOT percolate the read errors seen in 
leveldb to riak-admin (yes, that should start to happen in 1.3).


I will assume your leveldb "data_root" is "/var/db/riak/leveldb" for this 
script.  Please substitute your appropriate path from app.config and try these 
two commands on each physical server (node):

sort /var/db/riak/leveldb/*/LOG* | grep corrupted
sort /var/db/riak/leveldb/*/LOG* | grep checksum

If you get hits on either, we have found the problem.

The "LOCK file unavailable" is more of a statement about the internal condition 
of the code instead of an error.  The message is saying that the first attempt 
to re-open a vnode failed because the prior instance is still closing (or more 
likely waiting for Erlang's garbage collection to finish destroying things).  
This message is new to 1.2 code base.

Matthew




On Nov 7, 2012, at 6:56 PM, David Lowell wrote:

> After further thought, I want to add more color to this issue. I hypothesize 
> the symptoms I described here were continued fallout of an earlier crash. So 
> I've waded further back into to the logs to try to shed light on how the Riak 
> process was doing prior to this time. It was unhappy. It appears to have 
> failed and stopped several times in the 1/2 day prior. The pattern revealed 
> in the logs looks something like this:
> 
> - Following  a clean startup, the service runs for a while
> 
> - Usually the first log entry of any kind near the beginning of a long block 
> of error logs is "alarm_handler: {set,{system_memory_high_watermark,[]}}"  
>       ( Is this indicating excessive memory use? )
> 
> - Within a few minutes we see several log messages warning about "long_gc", 
> which I assume is an indication that garbage collection took longer than some 
> threshold
> 
> - Within the next minute or two, we start to see the legions of errors, 
> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of 
> timeout when trying to communicate with the eleveldb backend
> 
> - Eventually we see a log record indicating "Failed to start 
> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily unavailable
> 
> - Then Riak starts to exit:  riak:stop:46 "backend module failed to start." 
> 
> 
> So, not really knowing the Riak internals, it's a little difficult to piece 
> together the story here. Could be we're running low on memory. Hard to know 
> why riak_kv_workers are failing, or why this leveldb LOCK file is 
> unavailable. To those more learned, do these log records tell a story?
> 
> For the record, we're using the default 8 MB of leveldb cache per vnode, so 
> that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 GB of 
> physical memory. Are there other pieces of Riak that can use a lot of memory 
> that we need to look out for?
> 
> I'll include a few of the actual log records for reference, below.
> 
> Dave
> 
> --
> Dave Lowell
> [email protected]
> 
> Representative logs, many similar ones deleted for brevity:
> 
> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: 
> {set,{system_memory_high_watermark,[]}}
> 2012-11-07 01:55:50.517 [info] 
> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
> <0.18585.32> 
> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
>  
> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
> 2012-11-07 01:56:20.303 [error] <0.9231.0>@riak_core_vnode:handle_info:510 
> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker pool 
> crashed 
> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state 
> ready terminated with reason: 
> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 Failed to 
> start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock 
> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
>  Resource temporarily unavailable"}
> 2012-11-07 01:57:27.425 [info] 
> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
> <0.19181.32> 
> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
>  
> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process <0.10002.0> 
> with 0 neighbours exited with reason: 
> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>  in gen_fsm:terminate/7 line 
> 
> 
> On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
> 
>> Hello Riak Folks,
>> 
>> The last three days, we've been having a string of problems with Riak. An 
>> otherwise healthy server running our full application stack will suddenly 
>> start throwing a bunch of errors in the logs. Although the Riak processes 
>> stay up, most or all requests to Riak fail during these periods.
>> 
>> The errors in the logs are predominantly describing "riak_kv_vnode worker 
>> pool crashed" and timeout conditions. This morning, we had this crashy 
>> behavior start immediately after a clean Riak startup, and making a single 
>> call to our API, so the logs are quite free of other noise. I've summarized 
>> those logs below for curious parties, and can attach the full set of logs if 
>> needed.
>> 
>> I forgot to check this morning, but during a similar outage on Monday, the 
>> Riak server was refusing connections to new clients.
>> 
>> Interestingly, after giving Riak a while with no traffic at all today (like 
>> 15-30 minutes), it appears to have recovered without a restart. We've had 
>> similar recoveries during other "outages" of this type since Sunday evening. 
>> Facilitating this sort of recovery does seem to require shutting down all 
>> application KV requests for a while.
>> 
>> We're suspicious of some kind of corruption in the eleveldb on-disk files, 
>> because in past outages of this type, we've observed that the condition 
>> persists over reboots. But we don't have much more evidence than that. Is 
>> there a command we can run that will check over eleveldb files for 
>> corruption or inconsistency?
>> 
>> Other than that, what can cause "worker pool crashed" events? What do you 
>> know about the "timeouts" that are in these logs?
>> 
>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb backend 
>> with 512 partitions. We're running predominantly in a single-node 
>> configuration on a bunch of isolated dev boxes at the moment, on our way to 
>> spreading out our 512 vnodes onto 5 hosts in production.
>> 
>> Thanks for your help,
>> 
>> Dave
>> 
>> --
>> Dave Lowell
>> [email protected]
>> 
>> 
>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node 
>> '[email protected]'
>> 
>> ... normal startup messages ...
>> 
>> 2012-11-07 18:11:50.109 [info] 
>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for application 
>> riak_search (0 seconds)
>> 2012-11-07 18:22:18.509 [error] <0.2897.0>@riak_core_vnode:handle_info:510 
>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker pool 
>> crashed 
>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state ready 
>> terminated with reason: 
>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>> 
>> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>> 
>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process <0.2899.0> 
>> with 0 neighbours exited with reason: 
>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>  in gen_fsm:terminate/7 line 611
>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state ready 
>> terminated with reason: 
>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
>> 
>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with reason" 
>> and "gen_fsm <0.2989.0> in state ready terminated with reason" message pairs
>> 
>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> 
>> terminated with reason: 
>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process <0.15322.0> 
>> with 0 neighbours exited with reason: 
>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>  in gen_server:terminate/6 line 747
>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup had 
>> child undefined started with {riak_api_pb_server,start_link,undefined} at 
>> <0.15322.0> exit with reason 
>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>  in context child_terminated
> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: mysterious Riak problems

Reply via email to