Re: Very (very) slow handoff, how to investigate?

Ian Plosker Fri, 27 Jan 2012 07:44:51 -0800

Gal, 

You could try using `riak attach` and running the following to increase the 
handoff_concurrency from 1 to 4:


application:set_env(riak_core, handoff_concurrency, 4).

You will need to do this on all nodes. This will only remain in effect as long 
as the nodes remain running. If you wish to permanently increase the handoff 
concurrency you will have to do so in the app.config.

-- 
Ian Plosker <[email protected] (mailto:[email protected])>
Developer Advocate
Basho Technologies


On Friday, January 27, 2012 at 7:45 AM, Gal Barnea wrote:

> Hi Ian
> 
> Thanks for the informative answer, I am using 1.0.3 indeed.
> 
> A day later, the cluster is making progress, but than I saw this in the 
> console.log:
> 2012-01-27 08:51:32.643 [info] 
> <0.30733.2881>@riak_core_handoff_sender:start_fold:87 Handoff of partition 
> riak_kv_vnode 50239118783249787813
> 2516661246222288006726811648 from 
> '[email protected] 
> (mailto:[email protected])' to 
> '[email protected] 
> (mailto:[email protected])' completed: sent 5100479 
> objects in 10596.49 seconds
> 
> 
> so we've dropped 50% in rate and are now less than 500 records/second !
> 
> Frankly, I think this is problematic any way you look at it...If I need to 
> wait days every time I manually remove a server from the cluster, it isn't 
> really a valid solution from my perspective. 
> 
> Any thoughts?
> 
> Regards
> Gal 
> 
> 
> On Thu, Jan 26, 2012 at 11:35 PM, Ian Plosker <[email protected] 
> (mailto:[email protected])> wrote:
> > Gal, 
> > 
> > The limiting factor on EC2 will likely be IOPs (i.e. Disk throughput). EC2 
> > is a IOPs constrained environment, especially if you're using EBS. Further, 
> > doing a leave can induce a large number of ownership changes to ensure that 
> > preflists maintain the appropriate n_vals. The number of partitions that 
> > need to be shuffled can exceed 80% of all partitions. In short, it can take 
> > a while for the rebalance to complete. Assuming you're using a >=1.0 
> > release, you're cluster should still correctly respond to all incoming 
> > requests. 
> > 
> > Which version of Riak are you using? As of Riak 1.0.3, 
> > `handoff_concurrency`, the number of outgoing handoffs per node, is set to 
> > 1. This will reduce the rate at which the rebalance occurs, but it reduces 
> > the impact of the rebalance on your cluster. 
> > 
> > -- 
> > Ian Plosker <[email protected] (mailto:[email protected])>
> > Developer Advocate
> > Basho Technologies, Inc.
> > 
> > 
> > 
> > On Thursday, January 26, 2012 at 3:43 PM, Gal Barnea wrote:
> > 
> > 
> > 
> > > Ok, so now I can see in the "leaving" node logs:
> > > 2012-01-26 19 (tel:2012-01-26%2019):18:23.015 [info] 
> > > <0.32148.2873>@riak_core_handoff_sender:start_fold:39 Starting handoff of 
> > > partition riak_kv_vnode 685078892498860742907977265335757665463718379520 
> > > from '[email protected] 
> > > (mailto:[email protected])' to 
> > > '[email protected] 
> > > (mailto:[email protected])'
> > > 2012-01-26 19 (tel:2012-01-26%2019):24:17.798 [info] <0.31620.2873> 
> > > alarm_handler: {set,{system_memory_high_watermark,[]}}
> > > 2012-01-26 20 (tel:2012-01-26%2020):23:28.991 [info] 
> > > <0.32148.2873>@riak_core_handoff_sender:start_fold:87 Handoff of 
> > > partition riak_kv_vnode 685078892498860742907977265335757665463718379520 
> > > from '[email protected] 
> > > (mailto:[email protected])' to 
> > > '[email protected] 
> > > (mailto:[email protected])' completed: sent 
> > > 5110665 objects in 3905.97 seconds
> > > 
> > > so things *are* moving but at a rate of 1308 records per second.
> > > This sounds very slow to me, accounting for the small record size, the 
> > > high bw rate inside ec2 and practically 0% load on the servers
> > > 
> > > any thoughts? 
> > > 
> > > 
> > > 
> > > On Thu, Jan 26, 2012 at 10:12 PM, Gal Barnea <[email protected] 
> > > (mailto:[email protected])> wrote:
> > > > Hi all
> > > > 
> > > > I have a 6 server cluster running on ec2 (m1.large) - this is an 
> > > > evaluation environment, so practically no load besides the existing 
> > > > data (~200 million records, ~1k each) 
> > > > 
> > > > after running "riak-admin leave" on one of the node, I noticed that for 
> > > > more than 3 hours
> > > > 1 - member_status showed that there is one "leaving" node and pending 
> > > > data to handoff on the rest but the numbers never changed
> > > > 2 - riak-admin transfers -  showed handoffs waiting, but nothing changed
> > > > 
> > > > at this point, I restarted the "leaving" node, so now the status is 
> > > > 1 - member_status - still stuck with the same numbers
> > > > 2 - transfers - are slowly changing
> > > > 
> > > > The leaving server's logs are showing that a single handoff started 
> > > > after the restart,but nothing since (roughly an hour ago)
> > > > 
> > > > Interestingly, the leaving server is pretty idle while the remaining 
> > > > servers are working hard at 50%-60% cpu 
> > > > 
> > > > so, the question now is where should I dig around to try and understand 
> > > > what's going on. Any thoughts? 
> > > > 
> > > > Thanks
> > > > Gal
> > > > 
> > > >  
> > > > 
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > [email protected] (mailto:[email protected])
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > > 
> > > 
> > 
> > 
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Very (very) slow handoff, how to investigate?

Reply via email to