Re: mysterious Riak problems

David Lowell Wed, 14 Nov 2012 10:35:00 -0800

Thanks Matthew. I've run both greps with no hits, unfortunately.

A couple of details that I want to highlight. Since I first posted about this 
issue, we upgraded from Riak 1.2.0, to 1.2.1. Following that upgrade, we 
continue see these periods of instability with errors in the logs like  
"riak_kv_vnode worker pool crashed", but we started seeing lots of new error 
records in the logs about "Unrecognized message" as well.  In both cases, we 
see tons of these "long_gc" monitoring messages and several 
"system_memory_high_watermark" alarms during these periods. The client also has 
connection problems such as timeouts and connections being refused.


The logs from the last 18 hours on this server are really a mess, with very 
high levels of all of these errors. I'd be happy to send them along if you 
think that would help.

It's also worth noting that our application is pretty write heavy, has lots of 
parallel processes generating those writes (so not a lot of flow control if 
Riak bogs down, at least not yet). It's probably pushing Riak fairly hard.

Dave

--
Dave Lowell
[email protected]

On Nov 14, 2012, at 8:51 AM, Matthew Von-Maszewski wrote:

> Dave,
> 
> Just getting my head back into the game.  Was away for a few days.  Random 
> thought, maybe there is a hard drive with a read problem.  That can cause 
> issues similar to this.  1.2.1 does NOT percolate the read errors seen in 
> leveldb to riak-admin (yes, that should start to happen in 1.3).
> 
> I will assume your leveldb "data_root" is "/var/db/riak/leveldb" for this 
> script.  Please substitute your appropriate path from app.config and try 
> these two commands on each physical server (node):
> 
> sort /var/db/riak/leveldb/*/LOG* | grep corrupted
> sort /var/db/riak/leveldb/*/LOG* | grep checksum
> 
> If you get hits on either, we have found the problem.
> 
> The "LOCK file unavailable" is more of a statement about the internal 
> condition of the code instead of an error.  The message is saying that the 
> first attempt to re-open a vnode failed because the prior instance is still 
> closing (or more likely waiting for Erlang's garbage collection to finish 
> destroying things).  This message is new to 1.2 code base.
> 
> Matthew
> 
> 
> 
> 
> On Nov 7, 2012, at 6:56 PM, David Lowell wrote:
> 
>> After further thought, I want to add more color to this issue. I hypothesize 
>> the symptoms I described here were continued fallout of an earlier crash. So 
>> I've waded further back into to the logs to try to shed light on how the 
>> Riak process was doing prior to this time. It was unhappy. It appears to 
>> have failed and stopped several times in the 1/2 day prior. The pattern 
>> revealed in the logs looks something like this:
>> 
>> - Following  a clean startup, the service runs for a while
>> 
>> - Usually the first log entry of any kind near the beginning of a long block 
>> of error logs is "alarm_handler: {set,{system_memory_high_watermark,[]}}"  
>>      ( Is this indicating excessive memory use? )
>> 
>> - Within a few minutes we see several log messages warning about "long_gc", 
>> which I assume is an indication that garbage collection took longer than 
>> some threshold
>> 
>> - Within the next minute or two, we start to see the legions of errors, 
>> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of 
>> timeout when trying to communicate with the eleveldb backend
>> 
>> - Eventually we see a log record indicating "Failed to start 
>> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily 
>> unavailable
>> 
>> - Then Riak starts to exit:  riak:stop:46 "backend module failed to start." 
>> 
>> 
>> So, not really knowing the Riak internals, it's a little difficult to piece 
>> together the story here. Could be we're running low on memory. Hard to know 
>> why riak_kv_workers are failing, or why this leveldb LOCK file is 
>> unavailable. To those more learned, do these log records tell a story?
>> 
>> For the record, we're using the default 8 MB of leveldb cache per vnode, so 
>> that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 GB of 
>> physical memory. Are there other pieces of Riak that can use a lot of memory 
>> that we need to look out for?
>> 
>> I'll include a few of the actual log records for reference, below.
>> 
>> Dave
>> 
>> --
>> Dave Lowell
>> [email protected]
>> 
>> Representative logs, many similar ones deleted for brevity:
>> 
>> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: 
>> {set,{system_memory_high_watermark,[]}}
>> 2012-11-07 01:55:50.517 [info] 
>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>> <0.18585.32> 
>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
>>  
>> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
>> 2012-11-07 01:56:20.303 [error] <0.9231.0>@riak_core_vnode:handle_info:510 
>> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker pool 
>> crashed 
>> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
>> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state 
>> ready terminated with reason: 
>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 Failed 
>> to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock 
>> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
>>  Resource temporarily unavailable"}
>> 2012-11-07 01:57:27.425 [info] 
>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>> <0.19181.32> 
>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
>>  
>> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
>> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process <0.10002.0> 
>> with 0 neighbours exited with reason: 
>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>>  in gen_fsm:terminate/7 line 
>> 
>> 
>> On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
>> 
>>> Hello Riak Folks,
>>> 
>>> The last three days, we've been having a string of problems with Riak. An 
>>> otherwise healthy server running our full application stack will suddenly 
>>> start throwing a bunch of errors in the logs. Although the Riak processes 
>>> stay up, most or all requests to Riak fail during these periods.
>>> 
>>> The errors in the logs are predominantly describing "riak_kv_vnode worker 
>>> pool crashed" and timeout conditions. This morning, we had this crashy 
>>> behavior start immediately after a clean Riak startup, and making a single 
>>> call to our API, so the logs are quite free of other noise. I've summarized 
>>> those logs below for curious parties, and can attach the full set of logs 
>>> if needed.
>>> 
>>> I forgot to check this morning, but during a similar outage on Monday, the 
>>> Riak server was refusing connections to new clients.
>>> 
>>> Interestingly, after giving Riak a while with no traffic at all today (like 
>>> 15-30 minutes), it appears to have recovered without a restart. We've had 
>>> similar recoveries during other "outages" of this type since Sunday 
>>> evening. Facilitating this sort of recovery does seem to require shutting 
>>> down all application KV requests for a while.
>>> 
>>> We're suspicious of some kind of corruption in the eleveldb on-disk files, 
>>> because in past outages of this type, we've observed that the condition 
>>> persists over reboots. But we don't have much more evidence than that. Is 
>>> there a command we can run that will check over eleveldb files for 
>>> corruption or inconsistency?
>>> 
>>> Other than that, what can cause "worker pool crashed" events? What do you 
>>> know about the "timeouts" that are in these logs?
>>> 
>>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb backend 
>>> with 512 partitions. We're running predominantly in a single-node 
>>> configuration on a bunch of isolated dev boxes at the moment, on our way to 
>>> spreading out our 512 vnodes onto 5 hosts in production.
>>> 
>>> Thanks for your help,
>>> 
>>> Dave
>>> 
>>> --
>>> Dave Lowell
>>> [email protected]
>>> 
>>> 
>>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node 
>>> '[email protected]'
>>> 
>>> ... normal startup messages ...
>>> 
>>> 2012-11-07 18:11:50.109 [info] 
>>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for 
>>> application riak_search (0 seconds)
>>> 2012-11-07 18:22:18.509 [error] <0.2897.0>@riak_core_vnode:handle_info:510 
>>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker pool 
>>> crashed 
>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state 
>>> ready terminated with reason: 
>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>> 
>>> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>>> 
>>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process <0.2899.0> 
>>> with 0 neighbours exited with reason: 
>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>  in gen_fsm:terminate/7 line 611
>>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state 
>>> ready terminated with reason: 
>>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
>>> 
>>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with reason" 
>>> and "gen_fsm <0.2989.0> in state ready terminated with reason" message pairs
>>> 
>>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> 
>>> terminated with reason: 
>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process 
>>> <0.15322.0> with 0 neighbours exited with reason: 
>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>  in gen_server:terminate/6 line 747
>>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup had 
>>> child undefined started with {riak_api_pb_server,start_link,undefined} at 
>>> <0.15322.0> exit with reason 
>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>  in context child_terminated
>> 
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: mysterious Riak problems

Reply via email to