Guys, Thanks for the tips!! Helpful indeed.
-Matthew On Tue, Jan 31, 2012 at 2:32 AM, Gal Barnea <[email protected]> wrote: > Guys > Thanks a lot for the helpful pointers > > I decided to focus more on speeding the process of joining servers to the > cluster, where it is easier to monitor disk space during the handoff > (">watch df -B M" and "dstat -dn -D") and deduct the actual handoff > progress (When partitions are big enough, there is very little indication > of their handoff progress in riak's logs) > > Increasing the handoff_cuncurrency did indeed help push the handoff rate > much higher > Also, running on EC2 I was able to setup RAID0 on multiple ephemeral > drives which helped me reach rates of around 35-40 MB/s - practically the > IO limit set by amazon > > Thanks guys! > > > On Fri, Jan 27, 2012 at 9:26 PM, Joseph Blomstedt <[email protected]> wrote: > >> Gal, >> >> 0.5 to 1 MB/s is indeed painfully slow. >> >> A few questions: >> What backend are you running: bitcask,leveldb, etc? >> Are you using the local ephemeral storage, or running off EBS? >> Are you running any software RAID? >> What filesystem are you running? >> Which OS are you using? >> Have you changed any of Riak's default settings? >> >> Also, any chance you could provide the output of "iostat -x" during >> one of these long handoff sessions? Preferably on both the sending and >> receiving nodes. >> >> The more information we have, the better we can try to help out here. >> >> Regards, >> Joe >> >> On Fri, Jan 27, 2012 at 7:44 AM, Ian Plosker <[email protected]> wrote: >> > Gal, >> > >> > You could try using `riak attach` and running the following to increase >> the >> > handoff_concurrency from 1 to 4: >> > >> > application:set_env(riak_core, handoff_concurrency, 4). >> > >> > You will need to do this on all nodes. This will only remain in effect >> as >> > long as the nodes remain running. If you wish to permanently increase >> the >> > handoff concurrency you will have to do so in the app.config. >> > >> > -- >> > Ian Plosker <[email protected]> >> > Developer Advocate >> > Basho Technologies >> > >> > On Friday, January 27, 2012 at 7:45 AM, Gal Barnea wrote: >> > >> > Hi Ian >> > >> > Thanks for the informative answer, I am using 1.0.3 indeed. >> > >> > A day later, the cluster is making progress, but than I saw this in the >> > console.log: >> > 2012-01-27 08:51:32.643 [info] >> > <0.30733.2881>@riak_core_handoff_sender:start_fold:87 Handoff of >> partition >> > riak_kv_vnode 50239118783249787813 >> > 2516661246222288006726811648 from >> > '[email protected]' to >> > '[email protected]' completed: sent 5100479 >> objects >> > in 10596.49 seconds >> > >> > so we've dropped 50% in rate and are now less than 500 records/second ! >> > >> > Frankly, I think this is problematic any way you look at it...If I need >> to >> > wait days every time I manually remove a server from the cluster, it >> isn't >> > really a valid solution from my perspective. >> > >> > Any thoughts? >> > >> > Regards >> > Gal >> > >> > >> > On Thu, Jan 26, 2012 at 11:35 PM, Ian Plosker <[email protected]> wrote: >> > >> > Gal, >> > >> > The limiting factor on EC2 will likely be IOPs (i.e. Disk throughput). >> EC2 >> > is a IOPs constrained environment, especially if you're using EBS. >> Further, >> > doing a leave can induce a large number of ownership changes to ensure >> that >> > preflists maintain the appropriate n_vals. The number of partitions that >> > need to be shuffled can exceed 80% of all partitions. In short, it can >> take >> > a while for the rebalance to complete. Assuming you're using a >=1.0 >> > release, you're cluster should still correctly respond to all incoming >> > requests. >> > >> > Which version of Riak are you using? As of Riak 1.0.3, >> > `handoff_concurrency`, the number of outgoing handoffs per node, is set >> to >> > 1. This will reduce the rate at which the rebalance occurs, but it >> reduces >> > the impact of the rebalance on your cluster. >> > >> > -- >> > Ian Plosker <[email protected]> >> > Developer Advocate >> > Basho Technologies, Inc. >> > >> > On Thursday, January 26, 2012 at 3:43 PM, Gal Barnea wrote: >> > >> > Ok, so now I can see in the "leaving" node logs: >> > 2012-01-26 19:18:23.015 [info] >> > <0.32148.2873>@riak_core_handoff_sender:start_fold:39 Starting handoff >> of >> > partition riak_kv_vnode 685078892498860742907977265335757665463718379520 >> > from '[email protected]' to >> > '[email protected]' >> > 2012-01-26 19:24:17.798 [info] <0.31620.2873> alarm_handler: >> > {set,{system_memory_high_watermark,[]}} >> > 2012-01-26 20:23:28.991 [info] >> > <0.32148.2873>@riak_core_handoff_sender:start_fold:87 Handoff of >> partition >> > riak_kv_vnode 685078892498860742907977265335757665463718379520 from >> > '[email protected]' to >> > '[email protected]' completed: sent 5110665 >> objects >> > in 3905.97 seconds >> > >> > so things *are* moving but at a rate of 1308 records per second. >> > This sounds very slow to me, accounting for the small record size, the >> high >> > bw rate inside ec2 and practically 0% load on the servers >> > >> > any thoughts? >> > >> > >> > >> > On Thu, Jan 26, 2012 at 10:12 PM, Gal Barnea <[email protected]> >> wrote: >> > >> > Hi all >> > >> > I have a 6 server cluster running on ec2 (m1.large) - this is an >> evaluation >> > environment, so practically no load besides the existing data >> > (~200 million records, ~1k each) >> > >> > after running "riak-admin leave" on one of the node, I noticed that for >> more >> > than 3 hours >> > 1 - member_status showed that there is one "leaving" node and pending >> data >> > to handoff on the rest but the numbers never changed >> > 2 - riak-admin transfers - showed handoffs waiting, but nothing changed >> > >> > at this point, I restarted the "leaving" node, so now the status is >> > 1 - member_status - still stuck with the same numbers >> > 2 - transfers - are slowly changing >> > >> > The leaving server's logs are showing that a single handoff started >> after >> > the restart,but nothing since (roughly an hour ago) >> > >> > Interestingly, the leaving server is pretty idle while the remaining >> servers >> > are working hard at 50%-60% cpu >> > >> > so, the question now is where should I dig around to try and understand >> > what's going on. Any thoughts? >> > >> > Thanks >> > Gal >> > >> > >> > >> > >> > _______________________________________________ >> > riak-users mailing list >> > [email protected] >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > >> > >> > >> > >> > >> > _______________________________________________ >> > riak-users mailing list >> > [email protected] >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > >> >> >> >> -- >> Joseph Blomstedt <[email protected]> >> Software Engineer >> Basho Technologies, Inc. >> http://www.basho.com/ >> > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
