Hey,

The cluster is back up and running by going through the following steps:
  1) Force terminate east-riak-001 using the AWS console
  2) "riak-admin down [email protected]" on ALL nodes
  3) riak stop && riak start  on ALL nodes

All the nodes appeared to have been blocked trying to talk to riak 001 which was
the ring claimant at the time. Doing this seems to have cleared the state 
enough for
the cluster to make progress again.

In regards to the other questions:
  * Backend : LevelDB
  * OS: Ubuntu 10.04
  * Size: 500 bytes - 1KB
  * Traffic: 300 ops/sec

I will send the vm.args file offline too.

Best Regards,

Armon Dadgar


On Tuesday, May 8, 2012 at 11:12 AM, Mark Phillips wrote:

> Hey Armon, 
> 
> So "monitor busy_dist_port" means your nodes aren't talking but we need to 
> figure out why. Specifically it looks like you're kv vnodes aren't able to 
> communicate.
> 
> First questions
> 
> * Which backend are you using? 
> * What OS?
> * What size are your values?
> * What is the typical traffic (ops/second) on the cluster?
> 
> Also, if you could send a copy of your vm.args (probably best off-list) that 
> would be helpful, too. 
> 
> Mark 
> 
> On Tue, May 8, 2012 at 10:25 AM, Armon Dadgar <[email protected] 
> (mailto:[email protected])> wrote:
> > We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and 
> > are experiencing an issue where losing a single node has cause the entire
> > cluster to fail.
> > 
> > Nagios reported that node 1 had failed, shortly after, all the logs are 
> > filled with: 
> > 2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199 
> > Unable to forward put for 
> > {<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to 
> > '[email protected] 
> > (mailto:[email protected])' - nodedown
> > 2012-05-08 08:21:11.890 [error] 
> > <0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition 
> > riak_kv_vnode 456719261665907161938651510223838443642478919680 from 
> > '[email protected] 
> > (mailto:[email protected])' to 
> > '[email protected] 
> > (mailto:[email protected])' failed 
> > exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,'[email protected]
> >  (mailto:[email protected])'},handoff_port,infinity]}}
> > 2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199 
> > Unable to forward put for 
> > {<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to 
> > '[email protected] 
> > (mailto:[email protected])' - timeout
> > 2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199 
> > Unable to forward put for 
> > {<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to 
> > '[email protected] 
> > (mailto:[email protected])' - timeout
> > ...
> > 2012-05-08 08:30:26.379 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
> > <0.4921.2570> 
> > [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> >  {#Port<0.35446433>,'[email protected] 
> > (mailto:[email protected])'}
> > 2012-05-08 08:30:26.556 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got 
> > {suppressed,port_events,7}
> > 2012-05-08 08:30:26.616 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
> > <0.4930.2570> 
> > [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> >  {#Port<0.35446433>,'[email protected] 
> > (mailto:[email protected])'}
> > 2012-05-08 08:30:27.565 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got 
> > {suppressed,port_events,4}
> > 2012-05-08 08:30:27.668 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
> > <0.3151.2570> 
> > [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> >  {#Port<0.35446433>,'[email protected] 
> > (mailto:[email protected])'}
> > ...
> > 2012-05-08 10:20:30.088 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
> > <0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
> > t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] 
> > {#Port<0.35534018>,'[email protected] 
> > (mailto:[email protected])'}
> > 2012-05-08 10:20:31.261 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
> > <0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
> > t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] 
> > {#Port<0.35534018>,'[email protected] 
> > (mailto:[email protected])'}
> > 2012-05-08 10:20:32.736 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port 
> > <0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current
> > _function,{erlang,bif_return_trap,1}},{message_queue_len,0}] 
> > {#Port<0.35534018>,'[email protected] 
> > (mailto:[email protected])'}
> > 2012-05-08 10:20:33.552 [info] 
> > <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got 
> > {suppressed,port_events,3}
> > 
> > 
> > 
> > 
> > Now all the logs are basically being completely filled with "monitor 
> > busy_dist_port <0.22647.2610> 
> > [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}]
> >  {#Port<0.35563927>,'[email protected] 
> > (mailto:[email protected])'}" or similar. 
> > 
> > Riak-admin is unable to report any information about the cluster, and same 
> > with Riak Control.
> > Both just timeout and return:
> > 
> > production-vpc east-riak-002 riak $ riak-admin ring_status 
> > Attempting to restart script through sudo -u riak
> > RPC to '[email protected] 
> > (mailto:[email protected])' failed: {'EXIT',
> >                                                      {timeout,
> >                                                       {gen_server,call,
> >                                                        [riak_core_gossip,
> >                                                         legacy_gossip]}}}
> > 
> > 
> > At this point, the cluster has stopped responding to any requests as far as 
> > I can tell,
> > or any operations that do complete take well over 60 seconds for a single 
> > put with w=1.
> > 
> > Wondering if anybody else has seen this, and if so any advise for getting 
> > it resolved?
> > 
> > Best Regards,
> > 
> > Armon Dadgar
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > [email protected] (mailto:[email protected])
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> 

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to