Re: mysterious Riak problems

Matthew Von-Maszewski Wed, 14 Nov 2012 13:36:02 -0800

Dave,

The problem seems most pronounced when your are averaging 6 megabyte values.  
Honestly, my previous test suite only included 150k values.  The WriteThrottle 
is NOT giving you the support you need in this situation (you need it to be 
throttling a tad more).  I need to think on how to help here.


I see from the old emails that you have 32G of physical memory and this log 
suggests you have 512 vnodes per physical server.  Is this correct?  Also, how 
many physical servers?

Matthew

On Nov 14, 2012, at 2:00 PM, David Lowell wrote:

> Thanks Matthew. Yep, there are quite a few hits on 'waiting'. Interesting. 
> I'll send the merged log separately.
> 
> Dave
> 
> --
> Dave Lowell
> [email protected]
> 
> On Nov 14, 2012, at 10:43 AM, Matthew Von-Maszewski wrote:
> 
>> Dave,
>> 
>> Ok, heavy writes.  Let's see if leveldb has hit one of its intentional 
>> "stalls":
>> 
>>> sort /var/db/riak/leveldb/*/LOG* | grep -i waiting
>> 
>> See if that shows any indication of stall in the LOG files of leveldb.  If 
>> so, pick one server and send me a combined LOG file from that server:
>> 
>> sort /var/db/riak/leveldb/*/LOG* >LOG.all
>> 
>> That will tar zip really well.
>> 
>> Matthew
>> 
>> 
>> On Nov 14, 2012, at 1:34 PM, David Lowell wrote:
>> 
>>> Thanks Matthew. I've run both greps with no hits, unfortunately.
>>> 
>>> A couple of details that I want to highlight. Since I first posted about 
>>> this issue, we upgraded from Riak 1.2.0, to 1.2.1. Following that upgrade, 
>>> we continue see these periods of instability with errors in the logs like  
>>> "riak_kv_vnode worker pool crashed", but we started seeing lots of new 
>>> error records in the logs about "Unrecognized message" as well.  In both 
>>> cases, we see tons of these "long_gc" monitoring messages and several 
>>> "system_memory_high_watermark" alarms during these periods. The client also 
>>> has connection problems such as timeouts and connections being refused.
>>> 
>>> The logs from the last 18 hours on this server are really a mess, with very 
>>> high levels of all of these errors. I'd be happy to send them along if you 
>>> think that would help.
>>> 
>>> It's also worth noting that our application is pretty write heavy, has lots 
>>> of parallel processes generating those writes (so not a lot of flow control 
>>> if Riak bogs down, at least not yet). It's probably pushing Riak fairly 
>>> hard.
>>> 
>>> Dave
>>> 
>>> --
>>> Dave Lowell
>>> [email protected]
>>> 
>>> On Nov 14, 2012, at 8:51 AM, Matthew Von-Maszewski wrote:
>>> 
>>>> Dave,
>>>> 
>>>> Just getting my head back into the game.  Was away for a few days.  Random 
>>>> thought, maybe there is a hard drive with a read problem.  That can cause 
>>>> issues similar to this.  1.2.1 does NOT percolate the read errors seen in 
>>>> leveldb to riak-admin (yes, that should start to happen in 1.3).
>>>> 
>>>> I will assume your leveldb "data_root" is "/var/db/riak/leveldb" for this 
>>>> script.  Please substitute your appropriate path from app.config and try 
>>>> these two commands on each physical server (node):
>>>> 
>>>> sort /var/db/riak/leveldb/*/LOG* | grep corrupted
>>>> sort /var/db/riak/leveldb/*/LOG* | grep checksum
>>>> 
>>>> If you get hits on either, we have found the problem.
>>>> 
>>>> The "LOCK file unavailable" is more of a statement about the internal 
>>>> condition of the code instead of an error.  The message is saying that the 
>>>> first attempt to re-open a vnode failed because the prior instance is 
>>>> still closing (or more likely waiting for Erlang's garbage collection to 
>>>> finish destroying things).  This message is new to 1.2 code base.
>>>> 
>>>> Matthew
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Nov 7, 2012, at 6:56 PM, David Lowell wrote:
>>>> 
>>>>> After further thought, I want to add more color to this issue. I 
>>>>> hypothesize the symptoms I described here were continued fallout of an 
>>>>> earlier crash. So I've waded further back into to the logs to try to shed 
>>>>> light on how the Riak process was doing prior to this time. It was 
>>>>> unhappy. It appears to have failed and stopped several times in the 1/2 
>>>>> day prior. The pattern revealed in the logs looks something like this:
>>>>> 
>>>>> - Following  a clean startup, the service runs for a while
>>>>> 
>>>>> - Usually the first log entry of any kind near the beginning of a long 
>>>>> block of error logs is "alarm_handler: 
>>>>> {set,{system_memory_high_watermark,[]}}"  
>>>>>   ( Is this indicating excessive memory use? )
>>>>> 
>>>>> - Within a few minutes we see several log messages warning about 
>>>>> "long_gc", which I assume is an indication that garbage collection took 
>>>>> longer than some threshold
>>>>> 
>>>>> - Within the next minute or two, we start to see the legions of errors, 
>>>>> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of 
>>>>> timeout when trying to communicate with the eleveldb backend
>>>>> 
>>>>> - Eventually we see a log record indicating "Failed to start 
>>>>> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily 
>>>>> unavailable
>>>>> 
>>>>> - Then Riak starts to exit:  riak:stop:46 "backend module failed to 
>>>>> start." 
>>>>> 
>>>>> 
>>>>> So, not really knowing the Riak internals, it's a little difficult to 
>>>>> piece together the story here. Could be we're running low on memory. Hard 
>>>>> to know why riak_kv_workers are failing, or why this leveldb LOCK file is 
>>>>> unavailable. To those more learned, do these log records tell a story?
>>>>> 
>>>>> For the record, we're using the default 8 MB of leveldb cache per vnode, 
>>>>> so that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 GB 
>>>>> of physical memory. Are there other pieces of Riak that can use a lot of 
>>>>> memory that we need to look out for?
>>>>> 
>>>>> I'll include a few of the actual log records for reference, below.
>>>>> 
>>>>> Dave
>>>>> 
>>>>> --
>>>>> Dave Lowell
>>>>> [email protected]
>>>>> 
>>>>> Representative logs, many similar ones deleted for brevity:
>>>>> 
>>>>> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: 
>>>>> {set,{system_memory_high_watermark,[]}}
>>>>> 2012-11-07 01:55:50.517 [info] 
>>>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>>>>> <0.18585.32> 
>>>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
>>>>>  
>>>>> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
>>>>> 2012-11-07 01:56:20.303 [error] 
>>>>> <0.9231.0>@riak_core_vnode:handle_info:510 
>>>>> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker 
>>>>> pool crashed 
>>>>> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
>>>>> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state 
>>>>> ready terminated with reason: 
>>>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>>>>> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 
>>>>> Failed to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: lock 
>>>>> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
>>>>>  Resource temporarily unavailable"}
>>>>> 2012-11-07 01:57:27.425 [info] 
>>>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>>>>> <0.19181.32> 
>>>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
>>>>>  
>>>>> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
>>>>> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process 
>>>>> <0.10002.0> with 0 neighbours exited with reason: 
>>>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>>>>>  in gen_fsm:terminate/7 line 
>>>>> 
>>>>> 
>>>>> On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
>>>>> 
>>>>>> Hello Riak Folks,
>>>>>> 
>>>>>> The last three days, we've been having a string of problems with Riak. 
>>>>>> An otherwise healthy server running our full application stack will 
>>>>>> suddenly start throwing a bunch of errors in the logs. Although the Riak 
>>>>>> processes stay up, most or all requests to Riak fail during these 
>>>>>> periods.
>>>>>> 
>>>>>> The errors in the logs are predominantly describing "riak_kv_vnode 
>>>>>> worker pool crashed" and timeout conditions. This morning, we had this 
>>>>>> crashy behavior start immediately after a clean Riak startup, and making 
>>>>>> a single call to our API, so the logs are quite free of other noise. 
>>>>>> I've summarized those logs below for curious parties, and can attach the 
>>>>>> full set of logs if needed.
>>>>>> 
>>>>>> I forgot to check this morning, but during a similar outage on Monday, 
>>>>>> the Riak server was refusing connections to new clients.
>>>>>> 
>>>>>> Interestingly, after giving Riak a while with no traffic at all today 
>>>>>> (like 15-30 minutes), it appears to have recovered without a restart. 
>>>>>> We've had similar recoveries during other "outages" of this type since 
>>>>>> Sunday evening. Facilitating this sort of recovery does seem to require 
>>>>>> shutting down all application KV requests for a while.
>>>>>> 
>>>>>> We're suspicious of some kind of corruption in the eleveldb on-disk 
>>>>>> files, because in past outages of this type, we've observed that the 
>>>>>> condition persists over reboots. But we don't have much more evidence 
>>>>>> than that. Is there a command we can run that will check over eleveldb 
>>>>>> files for corruption or inconsistency?
>>>>>> 
>>>>>> Other than that, what can cause "worker pool crashed" events? What do 
>>>>>> you know about the "timeouts" that are in these logs?
>>>>>> 
>>>>>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb 
>>>>>> backend with 512 partitions. We're running predominantly in a 
>>>>>> single-node configuration on a bunch of isolated dev boxes at the 
>>>>>> moment, on our way to spreading out our 512 vnodes onto 5 hosts in 
>>>>>> production.
>>>>>> 
>>>>>> Thanks for your help,
>>>>>> 
>>>>>> Dave
>>>>>> 
>>>>>> --
>>>>>> Dave Lowell
>>>>>> [email protected]
>>>>>> 
>>>>>> 
>>>>>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on node 
>>>>>> '[email protected]'
>>>>>> 
>>>>>> ... normal startup messages ...
>>>>>> 
>>>>>> 2012-11-07 18:11:50.109 [info] 
>>>>>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for 
>>>>>> application riak_search (0 seconds)
>>>>>> 2012-11-07 18:22:18.509 [error] 
>>>>>> <0.2897.0>@riak_core_vnode:handle_info:510 
>>>>>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker 
>>>>>> pool crashed 
>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>>>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state 
>>>>>> ready terminated with reason: 
>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>>>> 
>>>>>> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>>>>>> 
>>>>>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process 
>>>>>> <0.2899.0> with 0 neighbours exited with reason: 
>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>>>>  in gen_fsm:terminate/7 line 611
>>>>>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state 
>>>>>> ready terminated with reason: 
>>>>>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
>>>>>> 
>>>>>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with 
>>>>>> reason" and "gen_fsm <0.2989.0> in state ready terminated with reason" 
>>>>>> message pairs
>>>>>> 
>>>>>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> 
>>>>>> terminated with reason: 
>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process 
>>>>>> <0.15322.0> with 0 neighbours exited with reason: 
>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>  in gen_server:terminate/6 line 747
>>>>>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup 
>>>>>> had child undefined started with 
>>>>>> {riak_api_pb_server,start_link,undefined} at <0.15322.0> exit with 
>>>>>> reason 
>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>  in context child_terminated
>>>>> 
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> [email protected]
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>>> 
>> 
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: mysterious Riak problems

Reply via email to