We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and 
are experiencing an issue where losing a single node has cause the entire
cluster to fail.

Nagios reported that node 1 had failed, shortly after, all the logs are filled 
with:
2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199 
Unable to forward put for 
{<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to 
'[email protected]' - nodedown
2012-05-08 08:21:11.890 [error] 
<0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition 
riak_kv_vnode 456719261665907161938651510223838443642478919680 from 
'[email protected]' to '[email protected]' 
failed 
exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,'[email protected]'},handoff_port,infinity]}}
2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199 
Unable to forward put for 
{<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to 
'[email protected]' - timeout
2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199 
Unable to forward put for 
{<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to 
'[email protected]' - timeout
...
2012-05-08 08:30:26.379 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
<0.4921.2570> 
[{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
 {#Port<0.35446433>,'[email protected]'}
2012-05-08 08:30:26.556 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got 
{suppressed,port_events,7}
2012-05-08 08:30:26.616 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
<0.4930.2570> 
[{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
 {#Port<0.35446433>,'[email protected]'}
2012-05-08 08:30:27.565 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got 
{suppressed,port_events,4}
2012-05-08 08:30:27.668 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
<0.3151.2570> 
[{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
 {#Port<0.35446433>,'[email protected]'}
...
2012-05-08 10:20:30.088 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
<0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] 
{#Port<0.35534018>,'[email protected]'}
2012-05-08 10:20:31.261 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
<0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] 
{#Port<0.35534018>,'[email protected]'}
2012-05-08 10:20:32.736 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
<0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current
_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] 
{#Port<0.35534018>,'[email protected]'}
2012-05-08 10:20:33.552 [info] 
<0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got 
{suppressed,port_events,3}




Now all the logs are basically being completely filled with "monitor 
busy_dist_port <0.22647.2610> 
[{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
 {#Port<0.35563927>,'[email protected]'}" or similar.

Riak-admin is unable to report any information about the cluster, and same with 
Riak Control.
Both just timeout and return:

production-vpc east-riak-002 riak $ riak-admin ring_status
Attempting to restart script through sudo -u riak
RPC to '[email protected]' failed: {'EXIT',
                                                     {timeout,
                                                      {gen_server,call,
                                                       [riak_core_gossip,
                                                        legacy_gossip]}}}


At this point, the cluster has stopped responding to any requests as far as I 
can tell,
or any operations that do complete take well over 60 seconds for a single put 
with w=1.

Wondering if anybody else has seen this, and if so any advise for getting it 
resolved?

Best Regards,

Armon Dadgar

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to