An update on the investigation and resolution yesterday.

These log messages were similar to those others have reported following a 
corruption of riak's merge index files. This thread from August was 
particularly helpful:

http://comments.gmane.org/gmane.comp.db.riak.user/8442

By disabling search on this host, I was able to get riak to start without 
crashing. This seemed to confirm the theory that corrupted merge index files 
were behind the instability.

Next I tried to run the recommended steps to repair the search indexes:

http://docs.basho.com/riak/latest/cookbooks/Repairing-Search-Indexes/

These efforts were unsuccessful. Riak needs to be running to apply these fixes, 
obviously. I could only get riak to stay up with search disabled. And running 
the commands in this how-to didn't seem to do anything initially, I assumed 
because search was disabled.

So I went back to the first thread above, and the external script suggested by 
Ryan Zezeski for detecting corruption in merge index files. Running that script 
on all 512 merge index partitions on this host showed that 13 of them were 
corrupted. I moved those 13 directories elsewhere, and started up riak with 
search enabled. It did not crash this time.

I again ran the above steps to repair the search index, in the hopes of 
replacing the indexed data that I had moved aside from replicas. With riak 
running with search enabled, the steps seemed to do something this time--there 
were records in the logs reporting repair activity. Unfortunately, riak 
processes started crashing during the repair. So I restarted riak to put it out 
of its misery, and it came up cleanly. I've decided to live with the loss of 
index data.

So that's where yesterday's issues ended up. This of course leaves open the 
question of how these index files got corrupted in the first place. I'm 
theorizing that the corruption is resulting from the riak weirdness I've 
reported in other threads, with "Unrecognized message", various timeouts, and 
"riak_kv_vnode worker pool crashed" errors.

Dave

--
Dave Lowell
[email protected]

On Nov 19, 2012, at 9:49 AM, David Lowell wrote:

> Riak is crashing on startup. This is on the same host on which I've reported 
> other riak weirdness. I've included representative logs below. Several 
> hundred similar log messages follow these ones, followed by messages about 
> riak shutting down.
> 
> This process is configured with a file descriptor limit of 100,000. We're 
> running riak 1.2.1. The host has 32 GB of physical memory. 512 vnodes. 
> 
> Any ideas?
> 
> Dave
> 
> --
> Dave Lowell
> [email protected]
> 
> 2012-11-19 17:27:07.309 [info] <0.7.0> Application riak_control started on 
> node '[email protected]'
> 2012-11-19 17:27:07.310 [info] <0.7.0> Application erlydtl started on node 
> '[email protected]'
> 2012-11-19 17:27:07.340 [info] <0.10567.0>@riak_core:wait_for_application:419 
> Wait complete for application riak_search (0 seconds)
> 2012-11-19 17:27:15.645 [error] <0.10773.0> CRASH REPORT Process <0.10773.0> 
> with 0 neighbours exited with reason: bad argument in call to 
> erlang:binary_to_term(<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,...>>)
>  in mi_buffer:read_value/2 line 162 in gen_server:init_it/6 line 328
> 2012-11-19 17:27:15.729 [error] <0.10772.0> CRASH REPORT Process <0.10772.0> 
> with 0 neighbours exited with reason: no match of right hand value 
> {error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,109,0,0,0,6,102,111,114,109,97,116,109,0,0,0,4,106,112,101,103,109,0,0,0,34,47,105,109,97,103,101,47,118,50,47,114,101,112,111,47,106,118,98,49,114,53,77,69,117,115,50,102,119,81,46,106,112,101,103,110,7,1,144,25,142,192,207,206,4,108,0,0,0,3,104,2,100,0,1,112,107,0,1,0,104,2,109,0,0,0,7,101,120,112,105,114,101,115,108,0,0,0,1,109,0,0,0,10,57,0,0,0,111,131,...>>],...},...]}}
>  in merge_index_backend:start/2 line 47 in gen_fsm:init_it/6 line 379
> 2012-11-19 17:27:15.784 [error] <0.138.0> Supervisor riak_core_vnode_sup had 
> child undefined started with {riak_core_vnode,start_link,undefined} at 
> <0.10772.0> exit with reason no match of right hand value 
> {error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,109,0,0,0,6,102,111,114,109,97,116,109,0,0,0,4,106,112,101,103,109,0,0,0,34,47,105,109,97,103,101,47,118,50,47,114,101,112,111,47,106,118,98,49,114,53,77,69,117,115,50,102,119,81,46,106,112,101,103,110,7,1,144,25,142,192,207,206,4,108,0,0,0,3,104,2,100,0,1,112,107,0,1,0,104,2,109,0,0,0,7,101,120,112,105,114,101,115,108,0,0,0,1,109,0,0,0,10,57,0,0,0,111,131,...>>],...},...]}}
>  in merge_index_backend:start/2 line 47 in context child_terminated
> 2012-11-19 17:27:15.843 [error] <0.155.0> gen_server riak_core_vnode_manager 
> terminated with reason: no match of right hand value 
> {error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,109,0,0,0,6,102,111,114,109,97,116,109,0,0,0,4,106,112,101,103,109,0,0,0,34,47,105,109,97,103,101,47,118,50,47,114,101,112,111,47,106,118,98,49,114,53,77,69,117,115,50,102,119,81,46,106,112,101,103,110,7,1,144,25,142,192,207,206,4,108,0,0,0,3,104,2,100,0,1,112,107,0,1,0,104,2,109,0,0,0,7,101,120,112,105,114,101,115,108,0,0,0,1,109,0,0,0,...>>],...},...

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to