Yuri, Bitcask merging is normal, but combined with incoming handoff, it may be overloading the node. Two things you might try: -reduce handoff_concurrency to 1 on all nodes to reduce the impact of handoff (http://docs.basho.com/riak/latest/references/Configuration-Files/) -restrict when Bitcask is allowed to merge by setting the merge_window on the node that is being overloaded (http://docs.basho.com/riak/latest/tutorials/choosing-a-backend/Bitcask/#Con figuring-Bitcask)
Joe Caswell From: Yuri Lukyanov <[email protected]> Date: Thursday, April 18, 2013 5:07 AM To: "[email protected]" <[email protected]> Subject: Simultaneous handoff and merge Hi, I have a cluster of 17 riak (1.2.1) nodes with bitcask as a backend. Recetly one of the node was down for a while. After the node had been started the cluster started doing handoffs as expected. But then a merge process also began on the same node. I know this from the log messages like this: 2013-04-18 08:14:09.061 [info] <0.22952.79> Merged ["/var/lib/riak/bitcask/496682197061674038608283517368424307461195825152" And then something went wrong (the logs on the same node): 2013-04-18 08:39:22.217 [error] <0.31842.70> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.4000.80> exit with reason {timeout,{gen_server,call,[riak_core_handoff_manager,{add_outbound,riak_kv_v node,208378163135070142634509751539626289911881007104,riak@nsto2r5,<0.4000.8 0>}]}} in context child_terminated 2013-04-18 08:42:46.067 [error] <0.5154.80> gen_server <0.5154.80> terminated with reason: {timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} 2013-04-18 08:42:52.790 [error] <0.5154.80> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: {timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} in gen_server:terminate/6 line 747 2013-04-18 08:42:53.450 [error] <0.31847.70> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at <0.5154.80> exit with reason {timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} in context child_terminated The node itself was disappearing from time to time: # riak-admin ring-status Node is not running! The beam process was still running though. Maybe it's not releated to handoffs & merge. It was just a guess. Any information and advice on this would be greatly appriciated. It's still happening right now and I could gather more details if someone wanted me to. Thanks in advance. _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
