Re: mysterious Riak problems

David Lowell Wed, 14 Nov 2012 13:46:15 -0800

The 512 vnodes will run on 5 physical nodes in production, but we're running 
all 512 on a single node in dev. And it's on one of these single node 
"clusters" that we're seeing these issues.


Dave

--
Dave Lowell
[email protected]

On Nov 14, 2012, at 1:35 PM, Matthew Von-Maszewski wrote:

> Dave,
> 
> The problem seems most pronounced when your are averaging 6 megabyte values.  
> Honestly, my previous test suite only included 150k values.  The 
> WriteThrottle is NOT giving you the support you need in this situation (you 
> need it to be throttling a tad more).  I need to think on how to help here.
> 
> I see from the old emails that you have 32G of physical memory and this log 
> suggests you have 512 vnodes per physical server.  Is this correct?  Also, 
> how many physical servers?
> 
> Matthew
> 
> On Nov 14, 2012, at 2:00 PM, David Lowell wrote:
> 
>> Thanks Matthew. Yep, there are quite a few hits on 'waiting'. Interesting. 
>> I'll send the merged log separately.
>> 
>> Dave
>> 
>> --
>> Dave Lowell
>> [email protected]
>> 
>> On Nov 14, 2012, at 10:43 AM, Matthew Von-Maszewski wrote:
>> 
>>> Dave,
>>> 
>>> Ok, heavy writes.  Let's see if leveldb has hit one of its intentional 
>>> "stalls":
>>> 
>>>> sort /var/db/riak/leveldb/*/LOG* | grep -i waiting
>>> 
>>> See if that shows any indication of stall in the LOG files of leveldb.  If 
>>> so, pick one server and send me a combined LOG file from that server:
>>> 
>>> sort /var/db/riak/leveldb/*/LOG* >LOG.all
>>> 
>>> That will tar zip really well.
>>> 
>>> Matthew
>>> 
>>> 
>>> On Nov 14, 2012, at 1:34 PM, David Lowell wrote:
>>> 
>>>> Thanks Matthew. I've run both greps with no hits, unfortunately.
>>>> 
>>>> A couple of details that I want to highlight. Since I first posted about 
>>>> this issue, we upgraded from Riak 1.2.0, to 1.2.1. Following that upgrade, 
>>>> we continue see these periods of instability with errors in the logs like  
>>>> "riak_kv_vnode worker pool crashed", but we started seeing lots of new 
>>>> error records in the logs about "Unrecognized message" as well.  In both 
>>>> cases, we see tons of these "long_gc" monitoring messages and several 
>>>> "system_memory_high_watermark" alarms during these periods. The client 
>>>> also has connection problems such as timeouts and connections being 
>>>> refused.
>>>> 
>>>> The logs from the last 18 hours on this server are really a mess, with 
>>>> very high levels of all of these errors. I'd be happy to send them along 
>>>> if you think that would help.
>>>> 
>>>> It's also worth noting that our application is pretty write heavy, has 
>>>> lots of parallel processes generating those writes (so not a lot of flow 
>>>> control if Riak bogs down, at least not yet). It's probably pushing Riak 
>>>> fairly hard.
>>>> 
>>>> Dave
>>>> 
>>>> --
>>>> Dave Lowell
>>>> [email protected]
>>>> 
>>>> On Nov 14, 2012, at 8:51 AM, Matthew Von-Maszewski wrote:
>>>> 
>>>>> Dave,
>>>>> 
>>>>> Just getting my head back into the game.  Was away for a few days.  
>>>>> Random thought, maybe there is a hard drive with a read problem.  That 
>>>>> can cause issues similar to this.  1.2.1 does NOT percolate the read 
>>>>> errors seen in leveldb to riak-admin (yes, that should start to happen in 
>>>>> 1.3).
>>>>> 
>>>>> I will assume your leveldb "data_root" is "/var/db/riak/leveldb" for this 
>>>>> script.  Please substitute your appropriate path from app.config and try 
>>>>> these two commands on each physical server (node):
>>>>> 
>>>>> sort /var/db/riak/leveldb/*/LOG* | grep corrupted
>>>>> sort /var/db/riak/leveldb/*/LOG* | grep checksum
>>>>> 
>>>>> If you get hits on either, we have found the problem.
>>>>> 
>>>>> The "LOCK file unavailable" is more of a statement about the internal 
>>>>> condition of the code instead of an error.  The message is saying that 
>>>>> the first attempt to re-open a vnode failed because the prior instance is 
>>>>> still closing (or more likely waiting for Erlang's garbage collection to 
>>>>> finish destroying things).  This message is new to 1.2 code base.
>>>>> 
>>>>> Matthew
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Nov 7, 2012, at 6:56 PM, David Lowell wrote:
>>>>> 
>>>>>> After further thought, I want to add more color to this issue. I 
>>>>>> hypothesize the symptoms I described here were continued fallout of an 
>>>>>> earlier crash. So I've waded further back into to the logs to try to 
>>>>>> shed light on how the Riak process was doing prior to this time. It was 
>>>>>> unhappy. It appears to have failed and stopped several times in the 1/2 
>>>>>> day prior. The pattern revealed in the logs looks something like this:
>>>>>> 
>>>>>> - Following  a clean startup, the service runs for a while
>>>>>> 
>>>>>> - Usually the first log entry of any kind near the beginning of a long 
>>>>>> block of error logs is "alarm_handler: 
>>>>>> {set,{system_memory_high_watermark,[]}}"  
>>>>>>  ( Is this indicating excessive memory use? )
>>>>>> 
>>>>>> - Within a few minutes we see several log messages warning about 
>>>>>> "long_gc", which I assume is an indication that garbage collection took 
>>>>>> longer than some threshold
>>>>>> 
>>>>>> - Within the next minute or two, we start to see the legions of errors, 
>>>>>> "riak_kv_vnode worker pool crashed", and "gen_fsm" having some sort of 
>>>>>> timeout when trying to communicate with the eleveldb backend
>>>>>> 
>>>>>> - Eventually we see a log record indicating "Failed to start 
>>>>>> riak_kv_eleveldb" because of a leveldb LOCK file being temporarily 
>>>>>> unavailable
>>>>>> 
>>>>>> - Then Riak starts to exit:  riak:stop:46 "backend module failed to 
>>>>>> start." 
>>>>>> 
>>>>>> 
>>>>>> So, not really knowing the Riak internals, it's a little difficult to 
>>>>>> piece together the story here. Could be we're running low on memory. 
>>>>>> Hard to know why riak_kv_workers are failing, or why this leveldb LOCK 
>>>>>> file is unavailable. To those more learned, do these log records tell a 
>>>>>> story?
>>>>>> 
>>>>>> For the record, we're using the default 8 MB of leveldb cache per vnode, 
>>>>>> so that ought to cap cache for our 512 vnodes at 4 GB. Our host has 32 
>>>>>> GB of physical memory. Are there other pieces of Riak that can use a lot 
>>>>>> of memory that we need to look out for?
>>>>>> 
>>>>>> I'll include a few of the actual log records for reference, below.
>>>>>> 
>>>>>> Dave
>>>>>> 
>>>>>> --
>>>>>> Dave Lowell
>>>>>> [email protected]
>>>>>> 
>>>>>> Representative logs, many similar ones deleted for brevity:
>>>>>> 
>>>>>> 2012-11-07 01:41:02.395 [info] <0.51.0> alarm_handler: 
>>>>>> {set,{system_memory_high_watermark,[]}}
>>>>>> 2012-11-07 01:55:50.517 [info] 
>>>>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>>>>>> <0.18585.32> 
>>>>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{riak_core_coverage_plan,'-next_vnode/2-fun-0-',2}},{message_queue_len,0}]
>>>>>>  
>>>>>> [{timeout,219},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,48},{old_heap_size,0},{heap_size,11430}]
>>>>>> 2012-11-07 01:56:20.303 [error] 
>>>>>> <0.9231.0>@riak_core_vnode:handle_info:510 
>>>>>> 1258832464966656615093408225054454710289582522368 riak_kv_vnode worker 
>>>>>> pool crashed 
>>>>>> {timeout,{gen_fsm,sync_send_event,[<0.9234.0>,{checkout,false,5000},5000]}}
>>>>>> 2012-11-07 01:56:22.382 [error] <0.10002.0> gen_fsm <0.10002.0> in state 
>>>>>> ready terminated with reason: 
>>>>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>>>>>> 2012-11-07 01:57:11.833 [error] <0.19755.32>@riak_kv_vnode:init:265 
>>>>>> Failed to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: 
>>>>>> lock 
>>>>>> /var/data/ctv/riak/leveldb/1258832464966656615093408225054454710289582522368/LOCK:
>>>>>>  Resource temporarily unavailable"}
>>>>>> 2012-11-07 01:57:27.425 [info] 
>>>>>> <0.73.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc 
>>>>>> <0.19181.32> 
>>>>>> [{initial_call,{riak_core_coverage_fsm,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
>>>>>>  
>>>>>> [{timeout,109},{old_heap_block_size,0},{heap_block_size,28657},{mbuf_size,0},{stack_size,47},{old_heap_size,0},{heap_size,9171}]
>>>>>> 2012-11-07 01:57:51.109 [error] <0.10002.0> CRASH REPORT Process 
>>>>>> <0.10002.0> with 0 neighbours exited with reason: 
>>>>>> {timeout,{gen_server,call,[<0.10005.0>,{work,<0.10003.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{40916762,{1398702738851840683437120250060505233655091691520,'[email protected]'}},<0.18429.32>}}]}}
>>>>>>  in gen_fsm:terminate/7 line 
>>>>>> 
>>>>>> 
>>>>>> On Nov 7, 2012, at 11:34 AM, David Lowell wrote:
>>>>>> 
>>>>>>> Hello Riak Folks,
>>>>>>> 
>>>>>>> The last three days, we've been having a string of problems with Riak. 
>>>>>>> An otherwise healthy server running our full application stack will 
>>>>>>> suddenly start throwing a bunch of errors in the logs. Although the 
>>>>>>> Riak processes stay up, most or all requests to Riak fail during these 
>>>>>>> periods.
>>>>>>> 
>>>>>>> The errors in the logs are predominantly describing "riak_kv_vnode 
>>>>>>> worker pool crashed" and timeout conditions. This morning, we had this 
>>>>>>> crashy behavior start immediately after a clean Riak startup, and 
>>>>>>> making a single call to our API, so the logs are quite free of other 
>>>>>>> noise. I've summarized those logs below for curious parties, and can 
>>>>>>> attach the full set of logs if needed.
>>>>>>> 
>>>>>>> I forgot to check this morning, but during a similar outage on Monday, 
>>>>>>> the Riak server was refusing connections to new clients.
>>>>>>> 
>>>>>>> Interestingly, after giving Riak a while with no traffic at all today 
>>>>>>> (like 15-30 minutes), it appears to have recovered without a restart. 
>>>>>>> We've had similar recoveries during other "outages" of this type since 
>>>>>>> Sunday evening. Facilitating this sort of recovery does seem to require 
>>>>>>> shutting down all application KV requests for a while.
>>>>>>> 
>>>>>>> We're suspicious of some kind of corruption in the eleveldb on-disk 
>>>>>>> files, because in past outages of this type, we've observed that the 
>>>>>>> condition persists over reboots. But we don't have much more evidence 
>>>>>>> than that. Is there a command we can run that will check over eleveldb 
>>>>>>> files for corruption or inconsistency?
>>>>>>> 
>>>>>>> Other than that, what can cause "worker pool crashed" events? What do 
>>>>>>> you know about the "timeouts" that are in these logs?
>>>>>>> 
>>>>>>> For the record, we're running Riak 1.2.0 on Ubuntu 10.04,  eleveldb 
>>>>>>> backend with 512 partitions. We're running predominantly in a 
>>>>>>> single-node configuration on a bunch of isolated dev boxes at the 
>>>>>>> moment, on our way to spreading out our 512 vnodes onto 5 hosts in 
>>>>>>> production.
>>>>>>> 
>>>>>>> Thanks for your help,
>>>>>>> 
>>>>>>> Dave
>>>>>>> 
>>>>>>> --
>>>>>>> Dave Lowell
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 2012-11-07 18:11:03.398 [info] <0.7.0> Application lager started on 
>>>>>>> node '[email protected]'
>>>>>>> 
>>>>>>> ... normal startup messages ...
>>>>>>> 
>>>>>>> 2012-11-07 18:11:50.109 [info] 
>>>>>>> <0.10582.0>@riak_core:wait_for_application:419 Wait complete for 
>>>>>>> application riak_search (0 seconds)
>>>>>>> 2012-11-07 18:22:18.509 [error] 
>>>>>>> <0.2897.0>@riak_core_vnode:handle_info:510 
>>>>>>> 105616329260241031198313161739262640092323250176 riak_kv_vnode worker 
>>>>>>> pool crashed 
>>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>>>>> 2012-11-07 18:22:18.509 [error] <0.2899.0> gen_fsm <0.2899.0> in state 
>>>>>>> ready terminated with reason: 
>>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>>>>> 
>>>>>>> ... 13 more "riak_kv_vnode worker pool crashed" messages...
>>>>>>> 
>>>>>>> 2012-11-07 18:22:21.245 [error] <0.2899.0> CRASH REPORT Process 
>>>>>>> <0.2899.0> with 0 neighbours exited with reason: 
>>>>>>> {timeout,{gen_server,call,[<0.2902.0>,{work,<0.2900.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{105616329260241031198313161739262640092323250176,'[email protected]'}},<0.15324.0>}}]}}
>>>>>>>  in gen_fsm:terminate/7 line 611
>>>>>>> 2012-11-07 18:22:21.844 [error] <0.2944.0> gen_fsm <0.2944.0> in state 
>>>>>>> ready terminated with reason: 
>>>>>>> {timeout,{gen_server,call,[<0.2947.0>,{work,<0.2945.0>,{fold,#Fun<riak_kv_eleveldb_backend.3.97398576>,#Fun<riak_kv_vnode.14.47983300>},{fsm,{96247562,{114179815416476790484662877555959610910619729920,'[email protected]'}},<0.15324.0>}}]}}
>>>>>>> 
>>>>>>> ... 13 more "CRASH REPORT Process <X> with 0 neighbours exited with 
>>>>>>> reason" and "gen_fsm <0.2989.0> in state ready terminated with reason" 
>>>>>>> message pairs
>>>>>>> 
>>>>>>> 2012-11-07 18:23:21.427 [error] <0.15322.0> gen_server <0.15322.0> 
>>>>>>> terminated with reason: 
>>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>> 2012-11-07 18:23:21.495 [error] <0.15322.0> CRASH REPORT Process 
>>>>>>> <0.15322.0> with 0 neighbours exited with reason: 
>>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>>  in gen_server:terminate/6 line 747
>>>>>>> 2012-11-07 18:23:21.525 [error] <0.10590.0> Supervisor riak_api_pb_sup 
>>>>>>> had child undefined started with 
>>>>>>> {riak_api_pb_server,start_link,undefined} at <0.15322.0> exit with 
>>>>>>> reason 
>>>>>>> {error,{case_clause,{error,timeout,[[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[<<"1352256943.4983411">>],[],[],[],...]}},...}
>>>>>>>  in context child_terminated
>>>>>> 
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>> 
>>>> 
>>> 
>> 
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: mysterious Riak problems

Reply via email to