Thanks Mark...

On Sat, Mar 9, 2013 at 1:29 AM, Mark Phillips <[email protected]> wrote:

> Hi Chris,
>
> Thanks for the detailed write up. These are some great data points.
>
> We're doing some work right now to make large rings (where "large" =
> more than 512 partitions) more efficient in terms of start and
> convergence time, and handoff.
>
>
Good to hear.


> First things first: since your test cluster has no data in it, adding
> "forced_ownership_handoff" to your riak_core section of your
> app.config and up'ing it to something higher than your ring size
> should help hasten convergence. *This is only useful for the purposes
> of testing and should not be done in production.* That would look like
> this:
>
> {forced_ownership_handoff, 512}
>
>
I'm doing this to understand current production problems. If it should not
be done in production then I'm not interested :D


> You could also increase the "max_concurrency" setting (which also has
> to be added to the riak_core section in your app.config). This
> defaults to "2". You could also look at lowering the
> "vnode_management_timer" from "10000" (10 seconds by default).
>
>
Can you point me at more documentation for "max_concurrency"? I've looked
at the code, and it's clear what "vnode_management_timer" does, but my
Elrang-foo is not good enough to be certain on the impacts of
"max_concurrency". How does it interact with "handoff_concurrency" (if at
all) which we currently have at 8?


> Back to the current limitations of Riak..
>
> A few members of the Basho eng team - primarily Joe Blomstedt - have
> been hacking on the ring-relate code for the last week or so are
> making some great progress. The improvements will be in the 1.4
> release (though it is a few months out from being official). To quote
> Joe from an internal email: "in my current work-in-progress branch, I
> successfully joined 4-nodes together using a 16384 ring yesterday.
> Still took about 20 min, but working on bringing that down even
> further today. Also, impact to cluster performance is worlds
> different."
>
> So, we're well aware of the improvements that need to be made in the
> arena and are working quickly to improve. I think Joe has plans to
> share his working code with the list in the near future (via a GitHub
> PR/Issue I suspect), so look out for that.
>
> In the interim, I would stick with a ring size of 512 or less for
> productions clusters if you're not already live, and lean on some
> beefier hardware to mitigate the current inefficiencies with large
> rings until the code is purified.
>
>
We're already live with a ring size of 512, and we're pretty close to 100TB
of data in there and we're seeing handoff problems already which is why I'm
investigating this further. As for beefiness of hardware, we're currently
running dual 6 core Intel X5675 machines with 96G RAM with 10G Ethernet
links between nodes. We're not going to get beefier any time soon.

We're currently growing the cluster from the 10 nodes we started with to 20
to cope with the load, but it takes almost 3 days for handoff to a new node
to happen. We're doing it one by one instead of as a bulk add operation
because with almost 200G per partition the window of "not found" errors
between a new node taking the partition and completing transfer to serve
the data is already uncomfortably high.



> Let us know if you have any other questions. Thanks for your testing
> and patience.
>

One more question I have is around changing the ring size on a running
cluster. Is it something that you're working on? While we're at less than
about 150TB of data in our cluster we can probably find a way to build a
second cluster and transfer the data over, but we expect to reach the stage
where we'll have over 500TB of data in riak, and at that stage we won't be
able to build a second cluster and don't want to be stuck with almost 1TB
of data per partition...

Thanks,

Chris


>
> Mark
>
>
> On Fri, Mar 8, 2013 at 6:37 AM, Chris Read <[email protected]> wrote:
> > Greetings all...
> >
> > While I can find lots of documentation about what a ring is and how it's
> > using in Riak, I've found very little that's actually useful about
> > determining the right size for your system. The most useful formula I've
> > found so far has been the simple:
> >
> > ring size = 2 ^ (ceiling(log(max nodes * min partitions per node, 2)))
> >
> > Where the minimum recommended number of partitions per node is 10 (as per
> >
> http://docs.basho.com/riak/latest/cookbooks/faqs/operations-faq/#is-it-possible-to-change-the-number-of-partitions
> ).
> >
> > Nothing tells me though what sane upper bound is for the amount of data
> in a
> > partition, or the overhead inside the cluster of managing larger ring
> sizes.
> > My gut feel though is that more than a couple of hundred gigabytes per
> > partition is getting a bit much.
> >
> > I've done some initial testing of ring sizes across a cluster of 9
> physical
> > machines and have seen some concerning results. All the numbers below are
> > done on the same hardware running Ubuntu 12.04 with Riak 1.3.0 (official
> > .deb release):
> >
> > Ring Size      |   512 |  1024 |    2048 |
> > Create Cluster | 01:53 | 05:41 | 0:12:58 |
> > Remove Node    | 04:01 | 10:31 | 0:31:13 |
> > Add Node       | 01:05 | 05:22 | 1:04:49 |
> >
> > All this is done with NO DATA in the cluster at all - so why does it take
> > over an hour to add a new node when ring=2048?
> >
> > Does it have anything to do with the concerns raised on this thread:
> >
> https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/DZkgkgd9YnA
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > [email protected]
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to