Alex's description roughly matches up with some of our plans to address this issue.
As with almost anything, this comes down to a tradeoff between consistency and availability. In the case of joining nodes, making the join/handoff/ownership claim process more "atomic" requires a higher degree of consensus from the machines in the cluster. The current process (which is clearly non-optimal) allows nodes to join the ring as long as they can contact one current ring member. A more atomic process would introduce consensus issues that might prevent nodes from joining in partitioned scenarios. A good solution would probably involve some consistency knobs around the join process to deal with a spectrum of failure/partition scenarios. This is something of which we are acutely aware and are actively pursuing solutions for a near-term release. - Andy On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <[email protected]>wrote: > I'm really loving this thread. Generating great ideas for the way > things should be... in the future. It seems to me that "the ring > changes immediately" is actually the problem as Ryan astutely > mentions. One way the future could look is : > > - a new node comes online > - introductions are made > - candidate vnodes are selected for migration (<- insert pixie dust magic > here) > - the number of simultaneous migrations are configurable, fewer for > limited interruption or more for quicker completion > - vnodes are migrated > - once migration is completed, ownership is claimed > > Selecting vnodes for migration is where the unicorn cavalry attack the > dragons den. If done right(er) the algorithm could be swappable to > optimize for different strategies. Don't ask me how to implement it, > I'm only a yellow belt in erlang-fu. > > Cheers, > Alexander > > On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[email protected]> wrote: > > John, > > All great points. The problem is that the ring changes immediately when > a > > node is added. So now, all the sudden, the preflist is potentially > pointing > > to nodes that don't have the data and they won't have that data until > > handoff occurs. The faster that data gets transferred, the less time > your > > clients have to hit 'notfound'. > > However, I agree completely with what you're saying. This is just a side > > effect of how the system currently works. In a perfect world we wouldn't > > care how long handoff takes and we would also do some sort of automatic > > congestion control akin to TCP Reno or something. The preflist would > still > > point to the "old" partitions until all data has been successfully handed > > off, and then and only then would we flip the switch for that vnode. I'm > > pretty sure that's where we are heading (I say "pretty sure" b/c I just > > joined the team and haven't been heavily involved in these specific talks > > yet). > > It's all coming down the pipe... > > As for your specific I/O question re handoff_concurrecy, you might be > right. > > I would think it depends on hardware/platform/etc. I was offering it as > a > > possible stopgap to minimize Greg's pain. It's certainly a cure to a > > symptom, not the problem itself. > > -Ryan > > > > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[email protected]> wrote: > >> > >> Hi Ryan, Greg, > >> > >> 2011/5/5 Ryan Zezeski <[email protected]> > >>> > >>> 1. For example, riak_core has a `handoff_concurrency` setting that > >>> determines how many vnodes can concurrently handoff on a given node. > By > >>> default this is set to 4. That's going to take a while with your 2048 > >>> vnodes and all :) > >> > >> Won't that make the handoff situation potentially worse? From the thread > I > >> understood that the main problem was that the cluster was shuffling too > much > >> data around and thus becoming unresponsive and/or returning unexpected > >> results (like "not founds"). I'm attributing the concerns more to an > >> excessive I/O situation than to how long the handoff takes. If the > handoff > >> can be made transparent (no or little side effects) I don't think most > >> people will really care (e.g. the "fix the cluster tomorrow" anecdote). > >> > >> How about using a percentage of available I/O to throttle the vnode > >> handoff concurrency? Start with 1, and monitor the node's I/O (kinda > like > >> 'atop' does, collection CPU, disk and network metrics), if it is below > the > >> expected usage, then increase the vnode handoff concurrency, and > vice-versa. > >> > >> I for one would be perfectly happy if the handoff took several hours > (even > >> days) if we could maintain the core riak_kv characteristics intact > during > >> those events. We've all seen looooong RAID rebuild times, and it's > usually > >> better to just sit tight and keep the rebuild speed low (slower I/O) > while > >> keeping all of the dependent systems running smoothly. > >> > >> cheers > >> -jd > > > > > > _______________________________________________ > > riak-users mailing list > > [email protected] > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
