There's no concurrency for writes to these objects, which is what I was hoping would simplify the problem. But it sounds like I'll have to turn on allow_mult and resolve conflicts anyway. On Thursday, May 5, 2011 at 2:23 PM, Bob Ippolito wrote: > It's not necessarily as much application logic as you might think, > you've just described what statebox [1] is an abstraction for (but it > encapsulates change history in the value). It's all Erlang, but the > technique could be applied in any language. That said, it's really > frustrating that data is unavailable during hand-off, but at least you > can mitigate it with a smart model (you should probably have this > anyway). We're also really looking forward to having this issue > resolved. > > Greg's usage pattern sounds like it's fundamentally inconsistent even > in the normal case when no handoff is occurring (assuming that there's > any concurrency for writes). > > [1] http://github.com/mochi/statebox > > On Thu, May 5, 2011 at 2:07 PM, Ben Tilly <[email protected]> wrote: > > There are solutions to that consistency issue. You can set > > allow_multi true, have each object have a link to a change history, > > and have each change have a record of what changed. The change > > history could be done as a singly linked list, where each change is > > inserted into a bucket with a randomly generated key. > > > > And then on reading an object, if you find siblings, you can go look > > at the change histories, merge them, and come up with a resolved > > object. > > > > This is a *lot* of application logic, but it should be doable. > > > > On Thu, May 5, 2011 at 1:14 PM, Greg Nelson <[email protected]> wrote: > > > The future I'd like to see is basically what I initially expected. That > > > is, > > > I can add a single node to an online cluster and clients should not even > > > see > > > any effects of this or need to know that it's even happening -- except of > > > course the side effects like the added load on the cluster incurred by > > > gossiping new ring state, handing off data, etc. But if no data has > > > actually been lost, I don't believe data should ever be unavailable, > > > temporarily or not. And I'd like to be able to, as someone else mentioned, > > > add a node and throttle the handoffs and let it trickle over hours or even > > > days. > > > > > > Waving hands and saying that eventually the data will make it is true in > > > principle, but in practice if you are following a read/modify/write > > > pattern > > > for some objects, you could easily lose data. e.g., my application writes > > > JSON arrays to certain objects, and when it wishes to append something to > > > the array, it will read/append/write back. If that initial read returns > > > 404, then a new empty array is created. This is normal operation. But if > > > that 404 is not a "normal" 404, it will happily create a new empty array, > > > append, and write back a single-element array to that key. Of course there > > > could have been a 100 element array in Riak that was just unavailable at > > > the > > > time which is now effectively lost. > > > > > > Anyhow, I do understand the importance of knowing what will happen when > > > doing something operationally like adding a node, and I understand that > > > one > > > can't naively expect everything to just work like magic. But the current > > > behavior is pretty poorly documented and surprising. I don't think it was > > > even mentioned in the operations webinar! (Ok, I'll stop beating a dead > > > horse. :)) > > > > > > On Thursday, May 5, 2011 at 12:22 PM, Alexander Sicular wrote: > > > > > > I'm really loving this thread. Generating great ideas for the way > > > things should be... in the future. It seems to me that "the ring > > > changes immediately" is actually the problem as Ryan astutely > > > mentions. One way the future could look is : > > > > > > - a new node comes online > > > - introductions are made > > > - candidate vnodes are selected for migration (<- insert pixie dust magic > > > here) > > > - the number of simultaneous migrations are configurable, fewer for > > > limited interruption or more for quicker completion > > > - vnodes are migrated > > > - once migration is completed, ownership is claimed > > > > > > Selecting vnodes for migration is where the unicorn cavalry attack the > > > dragons den. If done right(er) the algorithm could be swappable to > > > optimize for different strategies. Don't ask me how to implement it, > > > I'm only a yellow belt in erlang-fu. > > > > > > Cheers, > > > Alexander > > > > > > On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[email protected]> wrote: > > > > > > John, > > > All great points. The problem is that the ring changes immediately when a > > > node is added. So now, all the sudden, the preflist is potentially > > > pointing > > > to nodes that don't have the data and they won't have that data until > > > handoff occurs. The faster that data gets transferred, the less time your > > > clients have to hit 'notfound'. > > > However, I agree completely with what you're saying. This is just a side > > > effect of how the system currently works. In a perfect world we wouldn't > > > care how long handoff takes and we would also do some sort of automatic > > > congestion control akin to TCP Reno or something. The preflist would still > > > point to the "old" partitions until all data has been successfully handed > > > off, and then and only then would we flip the switch for that vnode. I'm > > > pretty sure that's where we are heading (I say "pretty sure" b/c I just > > > joined the team and haven't been heavily involved in these specific talks > > > yet). > > > It's all coming down the pipe... > > > As for your specific I/O question re handoff_concurrecy, you might be > > > right. > > > I would think it depends on hardware/platform/etc. I was offering it as a > > > possible stopgap to minimize Greg's pain. It's certainly a cure to a > > > symptom, not the problem itself. > > > -Ryan > > > > > > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[email protected]> wrote: > > > > > > Hi Ryan, Greg, > > > > > > 2011/5/5 Ryan Zezeski <[email protected]> > > > > > > 1. For example, riak_core has a `handoff_concurrency` setting that > > > determines how many vnodes can concurrently handoff on a given node. By > > > default this is set to 4. That's going to take a while with your 2048 > > > vnodes and all :) > > > > > > Won't that make the handoff situation potentially worse? From the thread I > > > understood that the main problem was that the cluster was shuffling too > > > much > > > data around and thus becoming unresponsive and/or returning unexpected > > > results (like "not founds"). I'm attributing the concerns more to an > > > excessive I/O situation than to how long the handoff takes. If the handoff > > > can be made transparent (no or little side effects) I don't think most > > > people will really care (e.g. the "fix the cluster tomorrow" anecdote). > > > > > > How about using a percentage of available I/O to throttle the vnode > > > handoff concurrency? Start with 1, and monitor the node's I/O (kinda like > > > 'atop' does, collection CPU, disk and network metrics), if it is below the > > > expected usage, then increase the vnode handoff concurrency, and > > > vice-versa. > > > > > > I for one would be perfectly happy if the handoff took several hours (even > > > days) if we could maintain the core riak_kv characteristics intact during > > > those events. We've all seen looooong RAID rebuild times, and it's usually > > > better to just sit tight and keep the rebuild speed low (slower I/O) while > > > keeping all of the dependent systems running smoothly. > > > > > > cheers > > > -jd > > > > > > > > > _______________________________________________ > > > riak-users mailing list > > > [email protected] > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > _______________________________________________ > > > riak-users mailing list > > > [email protected] > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > > > _______________________________________________ > > > riak-users mailing list > > > [email protected] > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > _______________________________________________ > > riak-users mailing list > > [email protected] > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
