Hey Armon, So "monitor busy_dist_port" means your nodes aren't talking but we need to figure out why. Specifically it looks like you're kv vnodes aren't able to communicate.
First questions * Which backend are you using? * What OS? * What size are your values? * What is the typical traffic (ops/second) on the cluster? Also, if you could send a copy of your vm.args (probably best off-list) that would be helpful, too. Mark On Tue, May 8, 2012 at 10:25 AM, Armon Dadgar <[email protected]>wrote: > We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and > are experiencing an issue where losing a single node has cause the entire > cluster to fail. > > Nagios reported that node 1 had failed, shortly after, all the logs are > filled with: > 2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199 > Unable to forward put for > {<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to ' > [email protected]' - nodedown > 2012-05-08 08:21:11.890 [error] > <0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition > riak_kv_vnode 456719261665907161938651510223838443642478919680 from ' > [email protected]' to '[email protected]' > failed exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,' > [email protected]'},handoff_port,infinity]}} > 2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199 > Unable to forward put for > {<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to ' > [email protected]' - timeout > 2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199 > Unable to forward put for > {<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to ' > [email protected]' - timeout > ... > 2012-05-08 08:30:26.379 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port > <0.4921.2570> > [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35446433>,'[email protected]'} > 2012-05-08 08:30:26.556 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got > {suppressed,port_events,7} > 2012-05-08 08:30:26.616 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port > <0.4930.2570> > [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35446433>,'[email protected]'} > 2012-05-08 08:30:27.565 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got > {suppressed,port_events,4} > 2012-05-08 08:30:27.668 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port > <0.3151.2570> > [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35446433>,'[email protected]'} > ... > 2012-05-08 10:20:30.088 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port > <0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren > t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35534018>,'[email protected]'} > 2012-05-08 10:20:31.261 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port > <0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren > t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35534018>,'[email protected]'} > 2012-05-08 10:20:32.736 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port > <0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current > _function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35534018>,'[email protected]'} > 2012-05-08 10:20:33.552 [info] > <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got > {suppressed,port_events,3} > > Now all the logs are basically being completely filled with "monitor > busy_dist_port <0.22647.2610> > [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] > {#Port<0.35563927>,'[email protected]'}" or similar. > > Riak-admin is unable to report any information about the cluster, and same > with Riak Control. > Both just timeout and return: > > production-vpc east-riak-002 riak $ riak-admin ring_status > Attempting to restart script through sudo -u riak > RPC to '[email protected]' failed: {'EXIT', > {timeout, > {gen_server,call, > [riak_core_gossip, > legacy_gossip]}}} > > At this point, the cluster has stopped responding to any requests as far > as I can tell, > or any operations that do complete take well over 60 seconds for a single > put with w=1. > > Wondering if anybody else has seen this, and if so any advise for getting > it resolved? > > Best Regards, > > Armon Dadgar > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
