We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and
are experiencing an issue where losing a single node has cause the entire
cluster to fail.
Nagios reported that node 1 had failed, shortly after, all the logs are filled
with:
2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199
Unable to forward put for
{<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to
'[email protected]' - nodedown
2012-05-08 08:21:11.890 [error]
<0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition
riak_kv_vnode 456719261665907161938651510223838443642478919680 from
'[email protected]' to '[email protected]'
failed
exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,'[email protected]'},handoff_port,infinity]}}
2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199
Unable to forward put for
{<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to
'[email protected]' - timeout
2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199
Unable to forward put for
{<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to
'[email protected]' - timeout
...
2012-05-08 08:30:26.379 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
<0.4921.2570>
[{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35446433>,'[email protected]'}
2012-05-08 08:30:26.556 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got
{suppressed,port_events,7}
2012-05-08 08:30:26.616 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
<0.4930.2570>
[{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35446433>,'[email protected]'}
2012-05-08 08:30:27.565 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got
{suppressed,port_events,4}
2012-05-08 08:30:27.668 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
<0.3151.2570>
[{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35446433>,'[email protected]'}
...
2012-05-08 10:20:30.088 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
<0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35534018>,'[email protected]'}
2012-05-08 10:20:31.261 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
<0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35534018>,'[email protected]'}
2012-05-08 10:20:32.736 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port
<0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current
_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35534018>,'[email protected]'}
2012-05-08 10:20:33.552 [info]
<0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got
{suppressed,port_events,3}
Now all the logs are basically being completely filled with "monitor
busy_dist_port <0.22647.2610>
[{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
{#Port<0.35563927>,'[email protected]'}" or similar.
Riak-admin is unable to report any information about the cluster, and same with
Riak Control.
Both just timeout and return:
production-vpc east-riak-002 riak $ riak-admin ring_status
Attempting to restart script through sudo -u riak
RPC to '[email protected]' failed: {'EXIT',
{timeout,
{gen_server,call,
[riak_core_gossip,
legacy_gossip]}}}
At this point, the cluster has stopped responding to any requests as far as I
can tell,
or any operations that do complete take well over 60 seconds for a single put
with w=1.
Wondering if anybody else has seen this, and if so any advise for getting it
resolved?
Best Regards,
Armon Dadgar
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com