Hi Ian Thanks for the informative answer, I am using 1.0.3 indeed.
A day later, the cluster is making progress, but than I saw this in the console.log: 2012-01-27 08:51:32.643 [info] <0.30733.2881>@riak_core_handoff_sender:start_fold:87 Handoff of partition riak_kv_vnode 50239118783249787813 2516661246222288006726811648 from ' [email protected]' to ' [email protected]' completed: sent 5100479 objects in 10596.49 seconds so we've dropped 50% in rate and are now less than 500 records/second ! Frankly, I think this is problematic any way you look at it...If I need to wait days every time I manually remove a server from the cluster, it isn't really a valid solution from my perspective. Any thoughts? Regards Gal On Thu, Jan 26, 2012 at 11:35 PM, Ian Plosker <[email protected]> wrote: > Gal, > > The limiting factor on EC2 will likely be IOPs (i.e. Disk throughput). EC2 > is a IOPs constrained environment, especially if you're using EBS. Further, > doing a leave can induce a large number of ownership changes to ensure that > preflists maintain the appropriate n_vals. The number of partitions that > need to be shuffled can exceed 80% of all partitions. In short, it can take > a while for the rebalance to complete. Assuming you're using a >=1.0 > release, you're cluster should still correctly respond to all incoming > requests. > > Which version of Riak are you using? As of Riak 1.0.3, > `handoff_concurrency`, the number of outgoing handoffs per node, is set to > 1. This will reduce the rate at which the rebalance occurs, but it reduces > the impact of the rebalance on your cluster. > > -- > Ian Plosker <[email protected]> > Developer Advocate > Basho Technologies, Inc. > > On Thursday, January 26, 2012 at 3:43 PM, Gal Barnea wrote: > > Ok, so now I can see in the "leaving" node logs: > 2012-01-26 19:18:23.015 [info] > <0.32148.2873>@riak_core_handoff_sender:start_fold:39 Starting handoff of > partition riak_kv_vnode 685078892498860742907977265335757665463718379520 > from '[email protected]' to ' > [email protected]' > 2012-01-26 19:24:17.798 [info] <0.31620.2873> alarm_handler: > {set,{system_memory_high_watermark,[]}} > 2012-01-26 20:23:28.991 [info] > <0.32148.2873>@riak_core_handoff_sender:start_fold:87 Handoff of partition > riak_kv_vnode 685078892498860742907977265335757665463718379520 from ' > [email protected]' to ' > [email protected]' completed: sent 5110665 > objects in 3905.97 seconds > > so things *are* moving but at a rate of 1308 records per second. > This sounds very slow to me, accounting for the small record size, the > high bw rate inside ec2 and practically 0% load on the servers > > any thoughts? > > > > On Thu, Jan 26, 2012 at 10:12 PM, Gal Barnea <[email protected]>wrote: > > Hi all > > I have a 6 server cluster running on ec2 (m1.large) - this is an > evaluation environment, so practically no load besides the existing data > (~200 million records, ~1k each) > > after running "riak-admin leave" on one of the node, I noticed that for > more than 3 hours > 1 - member_status showed that there is one "leaving" node and pending data > to handoff on the rest but the numbers never changed > 2 - riak-admin transfers - showed handoffs waiting, but nothing changed > > at this point, I restarted the "leaving" node, so now the status is > 1 - member_status - still stuck with the same numbers > 2 - transfers - are slowly changing > > The leaving server's logs are showing that a single handoff started after > the restart,but nothing since (roughly an hour ago) > > Interestingly, the leaving server is pretty idle while the remaining > servers are working hard at 50%-60% cpu > > so, the question now is where should I dig around to try and understand > what's going on. Any thoughts? > > Thanks > Gal > > > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
