An update on the investigation and resolution yesterday. These log messages were similar to those others have reported following a corruption of riak's merge index files. This thread from August was particularly helpful:
http://comments.gmane.org/gmane.comp.db.riak.user/8442 By disabling search on this host, I was able to get riak to start without crashing. This seemed to confirm the theory that corrupted merge index files were behind the instability. Next I tried to run the recommended steps to repair the search indexes: http://docs.basho.com/riak/latest/cookbooks/Repairing-Search-Indexes/ These efforts were unsuccessful. Riak needs to be running to apply these fixes, obviously. I could only get riak to stay up with search disabled. And running the commands in this how-to didn't seem to do anything initially, I assumed because search was disabled. So I went back to the first thread above, and the external script suggested by Ryan Zezeski for detecting corruption in merge index files. Running that script on all 512 merge index partitions on this host showed that 13 of them were corrupted. I moved those 13 directories elsewhere, and started up riak with search enabled. It did not crash this time. I again ran the above steps to repair the search index, in the hopes of replacing the indexed data that I had moved aside from replicas. With riak running with search enabled, the steps seemed to do something this time--there were records in the logs reporting repair activity. Unfortunately, riak processes started crashing during the repair. So I restarted riak to put it out of its misery, and it came up cleanly. I've decided to live with the loss of index data. So that's where yesterday's issues ended up. This of course leaves open the question of how these index files got corrupted in the first place. I'm theorizing that the corruption is resulting from the riak weirdness I've reported in other threads, with "Unrecognized message", various timeouts, and "riak_kv_vnode worker pool crashed" errors. Dave -- Dave Lowell [email protected] On Nov 19, 2012, at 9:49 AM, David Lowell wrote: > Riak is crashing on startup. This is on the same host on which I've reported > other riak weirdness. I've included representative logs below. Several > hundred similar log messages follow these ones, followed by messages about > riak shutting down. > > This process is configured with a file descriptor limit of 100,000. We're > running riak 1.2.1. The host has 32 GB of physical memory. 512 vnodes. > > Any ideas? > > Dave > > -- > Dave Lowell > [email protected] > > 2012-11-19 17:27:07.309 [info] <0.7.0> Application riak_control started on > node '[email protected]' > 2012-11-19 17:27:07.310 [info] <0.7.0> Application erlydtl started on node > '[email protected]' > 2012-11-19 17:27:07.340 [info] <0.10567.0>@riak_core:wait_for_application:419 > Wait complete for application riak_search (0 seconds) > 2012-11-19 17:27:15.645 [error] <0.10773.0> CRASH REPORT Process <0.10773.0> > with 0 neighbours exited with reason: bad argument in call to > erlang:binary_to_term(<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,...>>) > in mi_buffer:read_value/2 line 162 in gen_server:init_it/6 line 328 > 2012-11-19 17:27:15.729 [error] <0.10772.0> CRASH REPORT Process <0.10772.0> > with 0 neighbours exited with reason: no match of right hand value > {error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,109,0,0,0,6,102,111,114,109,97,116,109,0,0,0,4,106,112,101,103,109,0,0,0,34,47,105,109,97,103,101,47,118,50,47,114,101,112,111,47,106,118,98,49,114,53,77,69,117,115,50,102,119,81,46,106,112,101,103,110,7,1,144,25,142,192,207,206,4,108,0,0,0,3,104,2,100,0,1,112,107,0,1,0,104,2,109,0,0,0,7,101,120,112,105,114,101,115,108,0,0,0,1,109,0,0,0,10,57,0,0,0,111,131,...>>],...},...]}} > in merge_index_backend:start/2 line 47 in gen_fsm:init_it/6 line 379 > 2012-11-19 17:27:15.784 [error] <0.138.0> Supervisor riak_core_vnode_sup had > child undefined started with {riak_core_vnode,start_link,undefined} at > <0.10772.0> exit with reason no match of right hand value > {error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,109,0,0,0,6,102,111,114,109,97,116,109,0,0,0,4,106,112,101,103,109,0,0,0,34,47,105,109,97,103,101,47,118,50,47,114,101,112,111,47,106,118,98,49,114,53,77,69,117,115,50,102,119,81,46,106,112,101,103,110,7,1,144,25,142,192,207,206,4,108,0,0,0,3,104,2,100,0,1,112,107,0,1,0,104,2,109,0,0,0,7,101,120,112,105,114,101,115,108,0,0,0,1,109,0,0,0,10,57,0,0,0,111,131,...>>],...},...]}} > in merge_index_backend:start/2 line 47 in context child_terminated > 2012-11-19 17:27:15.843 [error] <0.155.0> gen_server riak_core_vnode_manager > terminated with reason: no match of right hand value > {error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,1,104,4,104,3,109,0,0,0,14,99,116,118,95,105,109,103,95,115,101,97,114,99,104,109,0,0,0,6,102,111,114,109,97,116,109,0,0,0,4,106,112,101,103,109,0,0,0,34,47,105,109,97,103,101,47,118,50,47,114,101,112,111,47,106,118,98,49,114,53,77,69,117,115,50,102,119,81,46,106,112,101,103,110,7,1,144,25,142,192,207,206,4,108,0,0,0,3,104,2,100,0,1,112,107,0,1,0,104,2,109,0,0,0,7,101,120,112,105,114,101,115,108,0,0,0,1,109,0,0,0,...>>],...},...
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
