Re: Very (very) slow handoff, how to investigate?

Ian Plosker Thu, 26 Jan 2012 13:34:57 -0800

Gal, 

The limiting factor on EC2 will likely be IOPs (i.e. Disk throughput). EC2 is a 
IOPs constrained environment, especially if you're using EBS. Further, doing a 
leave can induce a large number of ownership changes to ensure that preflists 
maintain the appropriate n_vals. The number of partitions that need to be 
shuffled can exceed 80% of all partitions. In short, it can take a while for 
the rebalance to complete. Assuming you're using a >=1.0 release, you're 
cluster should still correctly respond to all incoming requests.


Which version of Riak are you using? As of Riak 1.0.3, `handoff_concurrency`, 
the number of outgoing handoffs per node, is set to 1. This will reduce the 
rate at which the rebalance occurs, but it reduces the impact of the rebalance 
on your cluster. 

-- 
Ian Plosker <[email protected] (mailto:[email protected])>
Developer Advocate
Basho Technologies, Inc.



On Thursday, January 26, 2012 at 3:43 PM, Gal Barnea wrote:

> Ok, so now I can see in the "leaving" node logs:
> 2012-01-26 19:18:23.015 [info] 
> <0.32148.2873>@riak_core_handoff_sender:start_fold:39 Starting handoff of 
> partition riak_kv_vnode 685078892498860742907977265335757665463718379520 from 
> '[email protected] 
> (mailto:[email protected])' to 
> '[email protected] 
> (mailto:[email protected])'
> 2012-01-26 19:24:17.798 [info] <0.31620.2873> alarm_handler: 
> {set,{system_memory_high_watermark,[]}}
> 2012-01-26 20:23:28.991 [info] 
> <0.32148.2873>@riak_core_handoff_sender:start_fold:87 Handoff of partition 
> riak_kv_vnode 685078892498860742907977265335757665463718379520 from 
> '[email protected] 
> (mailto:[email protected])' to 
> '[email protected] 
> (mailto:[email protected])' completed: sent 5110665 
> objects in 3905.97 seconds
> 
> so things *are* moving but at a rate of 1308 records per second.
> This sounds very slow to me, accounting for the small record size, the high 
> bw rate inside ec2 and practically 0% load on the servers
> 
> any thoughts? 
> 
> 
> 
> On Thu, Jan 26, 2012 at 10:12 PM, Gal Barnea <[email protected] 
> (mailto:[email protected])> wrote:
> > Hi all
> > 
> > I have a 6 server cluster running on ec2 (m1.large) - this is an evaluation 
> > environment, so practically no load besides the existing data (~200 million 
> > records, ~1k each) 
> > 
> > after running "riak-admin leave" on one of the node, I noticed that for 
> > more than 3 hours
> > 1 - member_status showed that there is one "leaving" node and pending data 
> > to handoff on the rest but the numbers never changed
> > 2 - riak-admin transfers -  showed handoffs waiting, but nothing changed
> > 
> > at this point, I restarted the "leaving" node, so now the status is 
> > 1 - member_status - still stuck with the same numbers
> > 2 - transfers - are slowly changing
> > 
> > The leaving server's logs are showing that a single handoff started after 
> > the restart,but nothing since (roughly an hour ago)
> > 
> > Interestingly, the leaving server is pretty idle while the remaining 
> > servers are working hard at 50%-60% cpu 
> > 
> > so, the question now is where should I dig around to try and understand 
> > what's going on. Any thoughts? 
> > 
> > Thanks
> > Gal
> > 
> >  
> > 
> 
> _______________________________________________
> riak-users mailing list
> [email protected] (mailto:[email protected])
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Very (very) slow handoff, how to investigate?

Reply via email to