Re: Riak cluster unresponsive after single node failure

Mark Phillips Tue, 08 May 2012 11:12:40 -0700

Hey Armon,

So "monitor busy_dist_port" means your nodes aren't talking but we need to
figure out why. Specifically it looks like you're kv vnodes aren't able to
communicate.


First questions

* Which backend are you using?
* What OS?
* What size are your values?
* What is the typical traffic (ops/second) on the cluster?

Also, if you could send a copy of your vm.args (probably best off-list)
that would be helpful, too.

Mark

On Tue, May 8, 2012 at 10:25 AM, Armon Dadgar <[email protected]>wrote:

>  We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and
> are experiencing an issue where losing a single node has cause the entire
> cluster to fail.
>
> Nagios reported that node 1 had failed, shortly after, all the logs are
> filled with:
> 2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199
> Unable to forward put for
> {<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to '
> [email protected]' - nodedown
> 2012-05-08 08:21:11.890 [error]
> <0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition
> riak_kv_vnode 456719261665907161938651510223838443642478919680 from '
> [email protected]' to '[email protected]'
> failed exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,'
> [email protected]'},handoff_port,infinity]}}
> 2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199
> Unable to forward put for
> {<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to '
> [email protected]' - timeout
> 2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199
> Unable to forward put for
> {<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to '
> [email protected]' - timeout
> ...
> 2012-05-08 08:30:26.379 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
> <0.4921.2570>
> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35446433>,'[email protected]'}
> 2012-05-08 08:30:26.556 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got
> {suppressed,port_events,7}
> 2012-05-08 08:30:26.616 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
> <0.4930.2570>
> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35446433>,'[email protected]'}
> 2012-05-08 08:30:27.565 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got
> {suppressed,port_events,4}
> 2012-05-08 08:30:27.668 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
> <0.3151.2570>
> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35446433>,'[email protected]'}
> ...
> 2012-05-08 10:20:30.088 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
> <0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
> t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35534018>,'[email protected]'}
> 2012-05-08 10:20:31.261 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
> <0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
> t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35534018>,'[email protected]'}
> 2012-05-08 10:20:32.736 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
> <0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current
> _function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35534018>,'[email protected]'}
> 2012-05-08 10:20:33.552 [info]
> <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got
> {suppressed,port_events,3}
>
> Now all the logs are basically being completely filled with "monitor
> busy_dist_port <0.22647.2610>
> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> {#Port<0.35563927>,'[email protected]'}" or similar.
>
> Riak-admin is unable to report any information about the cluster, and same
> with Riak Control.
> Both just timeout and return:
>
> production-vpc east-riak-002 riak $ riak-admin ring_status
> Attempting to restart script through sudo -u riak
> RPC to '[email protected]' failed: {'EXIT',
>                                                      {timeout,
>                                                       {gen_server,call,
>                                                        [riak_core_gossip,
>                                                         legacy_gossip]}}}
>
> At this point, the cluster has stopped responding to any requests as far
> as I can tell,
> or any operations that do complete take well over 60 seconds for a single
> put with w=1.
>
> Wondering if anybody else has seen this, and if so any advise for getting
> it resolved?
>
> Best Regards,
>
> Armon Dadgar
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak cluster unresponsive after single node failure

Reply via email to