There are solutions to that consistency issue. You can set allow_multi true, have each object have a link to a change history, and have each change have a record of what changed. The change history could be done as a singly linked list, where each change is inserted into a bucket with a randomly generated key.
And then on reading an object, if you find siblings, you can go look at the change histories, merge them, and come up with a resolved object. This is a *lot* of application logic, but it should be doable. On Thu, May 5, 2011 at 1:14 PM, Greg Nelson <[email protected]> wrote: > The future I'd like to see is basically what I initially expected. That is, > I can add a single node to an online cluster and clients should not even see > any effects of this or need to know that it's even happening -- except of > course the side effects like the added load on the cluster incurred by > gossiping new ring state, handing off data, etc. But if no data has > actually been lost, I don't believe data should ever be unavailable, > temporarily or not. And I'd like to be able to, as someone else mentioned, > add a node and throttle the handoffs and let it trickle over hours or even > days. > > Waving hands and saying that eventually the data will make it is true in > principle, but in practice if you are following a read/modify/write pattern > for some objects, you could easily lose data. e.g., my application writes > JSON arrays to certain objects, and when it wishes to append something to > the array, it will read/append/write back. If that initial read returns > 404, then a new empty array is created. This is normal operation. But if > that 404 is not a "normal" 404, it will happily create a new empty array, > append, and write back a single-element array to that key. Of course there > could have been a 100 element array in Riak that was just unavailable at the > time which is now effectively lost. > > Anyhow, I do understand the importance of knowing what will happen when > doing something operationally like adding a node, and I understand that one > can't naively expect everything to just work like magic. But the current > behavior is pretty poorly documented and surprising. I don't think it was > even mentioned in the operations webinar! (Ok, I'll stop beating a dead > horse. :)) > > On Thursday, May 5, 2011 at 12:22 PM, Alexander Sicular wrote: > > I'm really loving this thread. Generating great ideas for the way > things should be... in the future. It seems to me that "the ring > changes immediately" is actually the problem as Ryan astutely > mentions. One way the future could look is : > > - a new node comes online > - introductions are made > - candidate vnodes are selected for migration (<- insert pixie dust magic > here) > - the number of simultaneous migrations are configurable, fewer for > limited interruption or more for quicker completion > - vnodes are migrated > - once migration is completed, ownership is claimed > > Selecting vnodes for migration is where the unicorn cavalry attack the > dragons den. If done right(er) the algorithm could be swappable to > optimize for different strategies. Don't ask me how to implement it, > I'm only a yellow belt in erlang-fu. > > Cheers, > Alexander > > On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[email protected]> wrote: > > John, > All great points. The problem is that the ring changes immediately when a > node is added. So now, all the sudden, the preflist is potentially pointing > to nodes that don't have the data and they won't have that data until > handoff occurs. The faster that data gets transferred, the less time your > clients have to hit 'notfound'. > However, I agree completely with what you're saying. This is just a side > effect of how the system currently works. In a perfect world we wouldn't > care how long handoff takes and we would also do some sort of automatic > congestion control akin to TCP Reno or something. The preflist would still > point to the "old" partitions until all data has been successfully handed > off, and then and only then would we flip the switch for that vnode. I'm > pretty sure that's where we are heading (I say "pretty sure" b/c I just > joined the team and haven't been heavily involved in these specific talks > yet). > It's all coming down the pipe... > As for your specific I/O question re handoff_concurrecy, you might be right. > I would think it depends on hardware/platform/etc. I was offering it as a > possible stopgap to minimize Greg's pain. It's certainly a cure to a > symptom, not the problem itself. > -Ryan > > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[email protected]> wrote: > > Hi Ryan, Greg, > > 2011/5/5 Ryan Zezeski <[email protected]> > > 1. For example, riak_core has a `handoff_concurrency` setting that > determines how many vnodes can concurrently handoff on a given node. By > default this is set to 4. That's going to take a while with your 2048 > vnodes and all :) > > Won't that make the handoff situation potentially worse? From the thread I > understood that the main problem was that the cluster was shuffling too much > data around and thus becoming unresponsive and/or returning unexpected > results (like "not founds"). I'm attributing the concerns more to an > excessive I/O situation than to how long the handoff takes. If the handoff > can be made transparent (no or little side effects) I don't think most > people will really care (e.g. the "fix the cluster tomorrow" anecdote). > > How about using a percentage of available I/O to throttle the vnode > handoff concurrency? Start with 1, and monitor the node's I/O (kinda like > 'atop' does, collection CPU, disk and network metrics), if it is below the > expected usage, then increase the vnode handoff concurrency, and vice-versa. > > I for one would be perfectly happy if the handoff took several hours (even > days) if we could maintain the core riak_kv characteristics intact during > those events. We've all seen looooong RAID rebuild times, and it's usually > better to just sit tight and keep the rebuild speed low (slower I/O) while > keeping all of the dependent systems running smoothly. > > cheers > -jd > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
