Re: Runaway "Failed to compact" errors

Joe Caswell Sun, 24 Nov 2013 14:12:42 -0800

Justin,

The terms being stored in merge index are too large. The maximum size for an
{Index, Field, Term} key is 32k bytes.
The binary blob in your log entry represents a tuple that was 32952 bytes.
Since merge index uses a 15-bit integer to store term size, if the
term_to_binary of the given key is larger than 32767, high bits are lost,
effectively storing (<large size> mod 32767) bytes.
When this data is read back, binary_to_term is unable to reconstruct the key
due the missing bytes, and throws a badarg exception.


Search index repair is document here:
http://docs.basho.com/riak/1.4.0/cookbooks/Repairing-Search-Indexes/
However,  you would need to first modify your extractor to not produce
search keys larger than 32k or the corruption issues will recur.

Joe Caswell


From:  Richard Shaw <[email protected]>
Date:  Sunday, November 24, 2013 4:25 PM
To:  Justin Long <[email protected]>
Cc:  riak-users <[email protected]>
Subject:  Re: Runaway "Failed to compact" errors

Ok thanks Justin, please can you change the vm.args on each node with the
following and then restart each node

-env ERL_MAX_ETS_TABLES 256000

I'd also like you to please confirm the ulimit on each server

$ riak attach
os:cmd("ulimit -n").

If you're running Riak >=1.4 then exit with Ctrl+c then a and if you're
running 1.3 or older then Ctrl+d to exit

I would recommend upping the ulimit to 65536 if its not already there [0]

[0]http://docs.basho.com/riak/latest/ops/tuning/open-files-limit/

I'm going to need to sign off at this point Justin, I'll see if a colleague
can take over.


Kind regards,

Richard



On 24 November 2013 21:00, Justin Long <[email protected]> wrote:
> Hi Richard,
> 
> Result turned up empty on the failed node. Here¹s what is in vm.args:
> 
> ------------------------------------------------------
> 
> # Name of the riak node
> -name [email protected]
> 
> ## Cookie for distributed erlang.  All nodes in the same cluster
> ## should use the same cookie or they will not be able to communicate.
> -setcookie riak
> 
> ## Heartbeat management; auto-restarts VM if it dies or becomes unresponsive
> ## (Disabled by default..use with caution!)
> ##-heart
> 
> ## Enable kernel poll and a few async threads
> +K true
> +A 64
> 
> ## Treat error_logger warnings as warnings
> +W w
> 
> ## Increase number of concurrent ports/sockets
> -env ERL_MAX_PORTS 4096
> 
> ## Tweak GC to run more often
> -env ERL_FULLSWEEP_AFTER 0
> 
> ## Set the location of crash dumps
> -env ERL_CRASH_DUMP /var/log/riak/erl_crash.dump
> 
> ## Raise the ETS table limit
> -env ERL_MAX_ETS_TABLES 22000
> 
> ------------------------------------------------------
> 
> 
> Before I received your email, I have since isolated the node and force-removed
> it from the cluster. In the meantime, I brought up a new fresh node and joined
> it to the cluster. When Riak went to handoff some of the RiakSearch indexes
> here is what was popping up in console.log:
> 
> ------------------------------------------------------
> 
> <0.4262.0>@merge_index_backend:async_fold_fun:116 failed to iterate the index
> with reason 
> {badarg,[{erlang,binary_to_term,[<<131,104,3,109,0,0,0,25,99,111,108,108,101,9
> 9,116,111,114,45,99,111,108,108,101,99,116,45,116,119,105,116,116,101,114,109,
> 0,0,0,14,100,97,116,97,95,102,111,108,108,111,119,101,114,115,109,0,0,128,199,
> 123,34,105,100,115,34,58,91,49,52,52,55,51,54,54,57,53,48,44,53,48,48,55,53,57
> ,48,55,44,52,51,56,49,55,53,52,56,53,44,49,51,54,53,49,50,49,52,50,44,52,54,50
> ,52,52,54,56,51,44,49,48,55,57,56,55,49,50,48,48,44,55,55,48,56,51,54,55,57,44
> ,50,56,51,56,51,57,55,56,44,49,57,50,48,55,50,55,51,48,44,51,57,54,57,56,56,57
> ,56,55,44,50,56,48,50,54,51,56,48,52,44,53,57,50,56,56,53,50,51,48,44,49,50,52
> ,55,53,56,57,53,55,56,44,49,55,51,56,56,51,53,52,50,44,49,53,56,57,54,51,50,50
> ,50,48,44,53,53,49,51,57,57,51,48,49,44,50,50,48,53,52,55,52,55,55,54,44,49,51
> ,51,52,57,56,57,56,50,53,44,51,49,50,51,57,53,55,54,50>>],[]},{mi_segment,iter
> ate_all_bytes,2,[{file,"src/mi_segment.erl"},{line,167}]},{mi_server,'-group_i
> terator/2-fun-1-',2,[{file,"src/mi_server.erl"},{line,725}]},{mi_server,'-grou
> p_iterator/2-fun-0-',2,[{file,"src/mi_server.erl"},{line,722}]},{mi_server,ite
> rate2,5,[{file,"src/mi_server.erl"},{line,693}]}]} and partial acc
> {{ho_acc,226,ok,#Fun<riak_core_handoff_sender.3.77335415>,riak_search_vnode,<0
> .4042.0>,#Port<0.754041>,{274031556999544297163190906134303066185487351808,274
> 031556999544297163190906134303066185487351808},{ho_stats,{1385,326398,498434},
> undefined,14225,2426123},gen_tcp,50226},{{<<"collector-collect-instagram-cache
> ">>,{<<"data_follows">>,<<"who">>}},[{<<"3700758">>,[{p,[561]}],13837594294135
> 36},{<<"368835984">>,[{p,[297,303]}],1383611963556763},{<<"368835984">>,[{p,[2
> 98,304]}],1383756713657753},{<<"31715058">>,[{p,[325]}],1383611996352193}]},4}
> 2013-11-24 20:53:17.468 [error]
> <0.16054.14>@riak_core_handoff_sender:start_fold:215 ownership_handoff
> transfer of riak_search_vnode from '[email protected]'
> 274031556999544297163190906134303066185487351808 to '[email protected]'
> 274031556999544297163190906134303066185487351808 failed because of
> error:{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,104,3,109,0,0,0,
> 25,99,111,108,108,101,99,116,111,114,45,99,111,108,108,101,99,116,45,116,119,1
> 05,116,116,101,114,109,0,0,0,14,100,97,116,97,95,102,111,108,108,111,119,101,1
> 14,115,109,0,0,128,199,123,34,105,100,115,34,58,91,49,52,52,55,51,54,54,57,53,
> 48,44,53,48,48,55,53,57,48,55,44,52,51,56,49,55,53,52,56,53,44,49,51,54,53,49,
> 50,49,52,50,44,52,54,50,52,52,54,56,51,44,49,48,55,57,56,55,49,50,48,48,44,55,
> 55,48,56,51,54,55,57,44,50,56,51,56,51,57,55,56,44,49,57,50,48,55,50,55,51,48,
> 44,51,57,54,57,56,56,57,56,55,44,50,56,48,50,54,51,56,48,52,44,53,57,50,56,56,
> 53,50,51,48,44,49,50,52,55,53,56,57,53,55,56,44,49,55,51,56,56,51,53,52,50,44,
> 49,53,56,57,54,51,50,50,50,48,44,53,53,49,51,57,57,51,48,49,44,50,50,48,53,52,
> 55,52,55,55,54,44,49,51,51,52,57,56,57,56,50,53,44,51,49,50,51,57,53,55,54,50>
> >],[]},{mi_segment,iterate_all_bytes,2,[{file,"src/mi_segment.erl"},{line,167}
> ]},{mi_server,'-group_iterator/2-fun-1-',2,[{file,"src/mi_server.erl"},{line,7
> 25}]},{mi_server,'-group_iterator/2-fun-0-',2,[{file,"src/mi_server.erl"},{lin
> e,722}]},{mi_server,iterate2,5,[{file,"src/mi_server.erl"},{line,693}]}]},{{ho
> _acc,226,ok,#Fun<riak_core_handoff_sender.3.77335415>,riak_search_vnode,<0.404
> 2.0>,#Port<0.754041>,{274031556999544297163190906134303066185487351808,2740315
> 56999544297163190906134303066185487351808},{ho_stats,{1385,326398,498434},unde
> fined,14225,2426123},gen_tcp,50226},{{<<"collector-collect-instagram-cache">>,
> {<<"data_follows">>,<<"who">>}},[{<<"3700758">>,[{p,[561]}],1383759429413536},
> {<<"368835984">>,[{p,[297,303]}],1383611963556763},{<<"368835984">>,[{p,[298,3
> 04]}],1383756713657753},{<<"31715058">>,[{p,[325]}],1383611996352193}]},4}}}
> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.e
> rl"},{line,161}]}]
> 
> ------------------------------------------------------
> 
> I am aware that values that bucket might be larger than most of our other
> objects. Not sure if that would cause the issues, though. Thanks for your
> help!
> 
> J
> 
> 
> 
> On Nov 24, 2013, at 12:51 PM, Richard Shaw <[email protected]> wrote:
> 
>> Hi Justin,
>> 
>> Please can you run this command to look for compaction errors in the leveldb
>> logs on the node with the crash log entries
>> 
>> grep -R "Compaction error" /var/lib/riak/leveldb/*/LOG
>> 
>> Where the path matches your path to the leveldb dir
>> 
>> Thanks
>> 
>> Richard
>> 
>> 
>> 
>> On 24 November 2013 10:45, Justin Long <[email protected]> wrote:
>>> Hello everyone,
>>> 
>>> Our Riak cluster has failed after what seems to be an issue in LevelDB.
>>> Noticed that a process running a segment compact has started to throw errors
>>> non-stop. I opened a Stack Overflow question here where you will find a lot
>>> of log data: 
>>> http://stackoverflow.com/questions/20172878/riak-is-throwing-failed-to-compa
>>> ct-like-crazy
>>> 
>>> Here is exactly what we're getting in console.log:
>>> 
>>> 2013-11-24 10:38:46.803 [info]
>>> <0.19760.0>@riak_core_handoff_receiver:process_message:99 Receiving handoff
>>> data for partition
>>> riak_search_vnode:1050454301831586472458898473514828420377701515264
>>> 2013-11-24 10:38:47.239 [info]
>>> <0.19760.0>@riak_core_handoff_receiver:handle_info:69 Handoff receiver for
>>> partition 1050454301831586472458898473514828420377701515264 exited after
>>> processing 5409 objects
>>> 2013-11-24 10:38:49.743 [error] emulator Error in process <0.19767.0> on
>>> node '[email protected]' with exit value:
>>> {badarg,[{erlang,binary_to_term,[<<260
>>> bytes>>],[]},{mi_segment,iterate_all_bytes,2,[{file,"src/mi_segment.erl"},{l
>>> ine,167}]},{mi_server,'-group_iterator/2-fun-0-',2,[{file,"src/mi_server.erl
>>> "},{line,722}]},{mi_server,'-group_iterator/2-fun-1-'...
>>> 
>>> 
>>> 2013-11-24 10:38:49.743 [error] <0.580.0>@mi_scheduler:worker_loop:141
>>> Failed to compact <0.11868.0>:
>>> {badarg,[{erlang,binary_to_term,[<<131,104,3,109,0,0,0,25,99,111,108,108,101
>>> ,99,116,111,114,45,99,111,108,108,101,99,116,45,116,119,105,116,116,101,114,
>>> 109,0,0,0,14,100,97,116,97,95,102,111,108,108,111,119,101,114,115,109,0,0,12
>>> 8,203,123,34,105,100,115,34,58,91,49,54,50,51,53,50,50,50,50,51,44,49,55,51,
>>> 55,51,52,52,50,44,49,50,56,51,52,52,56,55,51,57,44,51,57,56,56,57,56,50,51,5
>>> 2,44,49,52,52,55,51,54,54,57,53,48,44,53,48,48,55,53,57,48,55,44,52,51,56,49
>>> ,55,53,52,56,53,44,49,51,54,53,49,50,49,52,50,44,52,54,50,52,52,54,56,51,44,
>>> 49,48,55,57,56,55,49,50,48,48,44,55,55,48,56,51,54,55,57,44,50,56,51,56,51,5
>>> 7,55,56,44,49,57,50,48,55,50,55,51,48,44,51,57,54,57,56,56,57,56,55,44,50,56
>>> ,48,50,54,51,56,48,52,44,53,57,50,56,56,53,50,51,48,44,49,50,52,55,53,56,57,
>>> 53,55,56,44,49,55,51,56,56,51,53,52,50,44,49,53,56,57,54,51,50,50,50,48,44,5
>>> 3,53,49,51>>],[]},{mi_segment,iterate_all_bytes,2,[{file,"src/mi_segment.erl
>>> "},{line,167}]},{mi_server,'-group_iterator/2-fun-0-',2,[{file,"src/mi_serve
>>> r.erl"},{line,722}]},{mi_server,'-group_iterator/2-fun-1-',2,[{file,"src/mi_
>>> server.erl"},{line,725}]},{mi_server,'-group_iterator/2-fun-0-',2,[{file,"sr
>>> c/mi_server.erl"},{line,722}]},{mi_server,'-group_iterator/2-fun-1-',2,[{fil
>>> e,"src/mi_server.erl"},{line,725}]},{mi_server,'-group_iterator/2-fun-0-',2,
>>> [{file,"src/mi_server.erl"},{line,722}]},{mi_segment_writer,from_iterator,4,
>>> [{file,"src/mi_segment_writer.erl"},{line,110}]}]}
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> The log is just full of them. Thanks for your help! We need to get this
>>> cluster back up ASAP, appreciated!
>>> 
>>> - Justin
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> [email protected]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>> 
> 

_______________________________________________ riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Runaway "Failed to compact" errors

Reply via email to