Re: Very (very) slow handoff, how to investigate?

Gal Barnea Fri, 27 Jan 2012 04:45:55 -0800

Hi Ian

Thanks for the informative answer, I am using 1.0.3 indeed.


A day later, the cluster is making progress, but than I saw this in the
console.log:
2012-01-27 08:51:32.643 [info]
<0.30733.2881>@riak_core_handoff_sender:start_fold:87 Handoff of partition
riak_kv_vnode 50239118783249787813
2516661246222288006726811648 from '
[email protected]' to '
[email protected]' completed: sent 5100479 objects
in 10596.49 seconds

so we've dropped 50% in rate and are now less than 500 records/second !

Frankly, I think this is problematic any way you look at it...If I need to
wait days every time I manually remove a server from the cluster, it isn't
really a valid solution from my perspective.

Any thoughts?

Regards
Gal


On Thu, Jan 26, 2012 at 11:35 PM, Ian Plosker <[email protected]> wrote:

>  Gal,
>
> The limiting factor on EC2 will likely be IOPs (i.e. Disk throughput). EC2
> is a IOPs constrained environment, especially if you're using EBS. Further,
> doing a leave can induce a large number of ownership changes to ensure that
> preflists maintain the appropriate n_vals. The number of partitions that
> need to be shuffled can exceed 80% of all partitions. In short, it can take
> a while for the rebalance to complete. Assuming you're using a >=1.0
> release, you're cluster should still correctly respond to all incoming
> requests.
>
> Which version of Riak are you using? As of Riak 1.0.3,
> `handoff_concurrency`, the number of outgoing handoffs per node, is set to
> 1. This will reduce the rate at which the rebalance occurs, but it reduces
> the impact of the rebalance on your cluster.
>
> --
> Ian Plosker <[email protected]>
> Developer Advocate
> Basho Technologies, Inc.
>
> On Thursday, January 26, 2012 at 3:43 PM, Gal Barnea wrote:
>
> Ok, so now I can see in the "leaving" node logs:
> 2012-01-26 19:18:23.015 [info]
> <0.32148.2873>@riak_core_handoff_sender:start_fold:39 Starting handoff of
> partition riak_kv_vnode 685078892498860742907977265335757665463718379520
> from '[email protected]' to '
> [email protected]'
> 2012-01-26 19:24:17.798 [info] <0.31620.2873> alarm_handler:
> {set,{system_memory_high_watermark,[]}}
> 2012-01-26 20:23:28.991 [info]
> <0.32148.2873>@riak_core_handoff_sender:start_fold:87 Handoff of partition
> riak_kv_vnode 685078892498860742907977265335757665463718379520 from '
> [email protected]' to '
> [email protected]' completed: sent 5110665
> objects in 3905.97 seconds
>
> so things *are* moving but at a rate of 1308 records per second.
> This sounds very slow to me, accounting for the small record size, the
> high bw rate inside ec2 and practically 0% load on the servers
>
> any thoughts?
>
>
>
> On Thu, Jan 26, 2012 at 10:12 PM, Gal Barnea <[email protected]>wrote:
>
> Hi all
>
> I have a 6 server cluster running on ec2 (m1.large) - this is an
> evaluation environment, so practically no load besides the existing data
> (~200 million records, ~1k each)
>
> after running "riak-admin leave" on one of the node, I noticed that for
> more than 3 hours
> 1 - member_status showed that there is one "leaving" node and pending data
> to handoff on the rest but the numbers never changed
> 2 - riak-admin transfers -  showed handoffs waiting, but nothing changed
>
> at this point, I restarted the "leaving" node, so now the status is
> 1 - member_status - still stuck with the same numbers
> 2 - transfers - are slowly changing
>
> The leaving server's logs are showing that a single handoff started after
> the restart,but nothing since (roughly an hour ago)
>
> Interestingly, the leaving server is pretty idle while the remaining
> servers are working hard at 50%-60% cpu
>
> so, the question now is where should I dig around to try and understand
> what's going on. Any thoughts?
>
> Thanks
> Gal
>
>
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Very (very) slow handoff, how to investigate?

Reply via email to