On Fri, Feb 18, 2011 at 2:33 PM, Adam Kocoloski <[email protected]> wrote: > On Feb 18, 2011, at 11:16 AM, Paul Davis wrote: > >> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> >> wrote: >>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> >>> wrote: >>>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <[email protected]> >>>> wrote: >>>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <[email protected]> >>>>> wrote: >>>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a >>>>>> way to fast-forward replications (thanks Max for the prodding!). It's >>>>>> non-trivial, but I think the benefit for big networks of CouchDB servers >>>>>> can be substantial. >>>>>> >>>>>> The basic idea is that if A replicates with B, and B with C, then a new >>>>>> replication between A and C should not need to start from scratch. I >>>>>> think we can accomplish this as follows: >>>>>> >>>>>> 1) Store the target update sequence along with the source sequence in >>>>>> the checkpoint document, at least in the checkpoint document on the >>>>>> target. The following tuple is important: {Source, _local ID, Session >>>>>> ID, SourceSeq, TargetSeq}. Using that syntax let's say we have the >>>>>> following replication records: >>>>>> >>>>>> On A >>>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence >>>>>> on the source >>>>>> >>>>>> On B >>>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B >>>>>> {B, _local/Baz, Bif, 15, _TargetSeq} >>>>>> >>>>>> On C >>>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C >>>>>> >>>>>> We know that A -> B happened before B -> C. >>>>>> >>>>>> 2) During the B -> C replication, when we reach source sequence number >>>>>> 10, the _changes feed from B will deliver some extra information like >>>>>> >>>>>> {A, _local/Foo, Bar, 5} >>>>>> >>>>>> which will be stored at C. This may require a new disk-resident btree >>>>>> keyed on update sequence, or at least an in-memory index constructed by >>>>>> walking the _local docs btree. >>>>>> >>>>>> 3) When we trigger the A -> C replication, C will walk the full >>>>>> checkpoint records in its _local tree and find no mention of A, but then >>>>>> it will also consult the "transitive" checkpoints and find the {A, >>>>>> _local/Foo, Bar, 5} record. It'll consult _local/Foo on A, find that >>>>>> the session ID Bar is still present, and conclude that it can >>>>>> fast-forward the replication and start from update sequence 5. It will >>>>>> then remove that transitive checkpoint and replace it with a full >>>>>> regular checkpoint. >>>>>> >>>>>> If server A crashes after the A -> B replication and restores from a >>>>>> backup that was recorded before the replication, the session ID Bar will >>>>>> be missing from _local/Foo, so when we try to do the A -> replication we >>>>>> won't fast forward. This is the correct behavior. >>>>>> >>>>>> Hopefully this is comprehensible to someone other than me. We spent >>>>>> some time trying to poke holes in it, but it's entirely possible there >>>>>> are other things we didn't consider that will prevent it from working. >>>>>> Cheers, >>>>>> >>>>>> Adam >>>>> >>>>> What Adam said. Also, I was just doing a brain dump and I think I >>>>> might've punched a gaping whole into the whole scenario. I'm not >>>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh" >>>>> towards the end where my brain dump froze up. Its late so maybe I'm >>>>> just not seeing the easy way around it. >>>>> >>>>> There's also a picture of the end of our white board session at >>>>> http://plixi.com/p/78268064 which probably means little to nothing >>>>> without the context of having seen it drawn and the ensuing argument >>>>> and wild gesticulations. But its there for posterity. >>>>> >>>>> <brain_dump> >>>>> >>>>> Transitive Replication - The Idea >>>>> ================================= >>>>> >>>>> Consider the following scenario: >>>>> >>>>> 1. Replicate A -> B >>>>> 2. Replicate B -> C >>>>> 3. Replicate A -> C >>>>> >>>>> For simplicity's sake, assume no writes occur during this scenario. The >>>>> question is why can't we short circuit step 3 to effectively be a no-op? >>>>> >>>>> Current Situation >>>>> ================= >>>>> >>>>> Replication state is based on a pair-wise state reflecting source and >>>>> target information (and filter functions etc). For the above scenario to >>>>> be anywhere near plausible a couple things need to happen. First, we'll >>>>> obviously need to transfer data from B -> C during replication so it >>>>> has knowledge about A. This information will have to be complete enough >>>>> to short circuit (or skip part of) a replication from A. >>>>> >>>>> The information that B sends to C will need to enable a replication from >>>>> A to C to occur without error in any sort of pathological state of A >>>>> irregardless of what state C thinks A is in. Changes in state may include >>>>> A "forgetting" some edits and resetting to a point in time the state >>>>> that C has (for instance, A crashed and was recovered to a previous >>>>> point in time). >>>>> >>>>> C will also need to be able to uniquely identify A regardless of host or >>>>> other transitory characteristics. >>>>> >>>>> An Old Proposition >>>>> ================== >>>>> >>>>> There's been a proposal floated a few times for a few different reasons >>>>> to give each database a UUID so that it is uniquely identifiable for >>>>> various reasons (ETags come to mind). Such a UUID were it to exist would >>>>> allow us to uniquely identify a database in the above scenario. >>>>> >>>>> The first issue with db UUID's that always pops up is that we have to >>>>> address the case of what happens when someone copies a database (perhaps >>>>> to short circuit an initial replication, or restoring a db when a >>>>> machine fails) is that the UUID may no longer be globally unique. >>>>> >>>>> This would need to be fixed for transitive replication to have any >>>>> chance of working. One solution that was mentioned was to have each >>>>> CouchDB node remember all UUID's that it knows about and if a db is >>>>> opened with an unknown UUID, that db gets a new UUID assigned. >>>>> >>>>> This could be accomplished efficiently by storing _local docs in the >>>>> replicator database that reference known UUID/dbname pairs. Then we >>>>> just lookup the UUID on db open and if it doesn't match the db name >>>>> we reset it. >>>>> >>>>> For upgrade compatibility and the ability to change UUID's often we >>>>> could just store the UUID in the db header (as opposed to the first >>>>> sixteen bytes of the file). >>>>> >>>>> Information Propagation Requirements >>>>> ==================================== >>>>> >>>>> When replication occurs we need to inform the target database of a >>>>> few pieces of information so that it knows about transitive replications >>>>> that it contains. We also need to make sure that the target db doesn't >>>>> learn about this information before it contains the entire replica set >>>>> and it needs to be processed in such a way that it doesn't require >>>>> complete replications. >>>>> >>>>> These requirements pretty much lead us to the fact that the replica >>>>> state will need to be beamed across as the target receives information >>>>> from the source update sequence. Ie, when we iterate the _changes feed >>>>> we get extra info when we've arrived an update_seq that wholly contains >>>>> some prior replication from an arbitrary node to the *source*. >>>>> >>>>> Information to Propagate >>>>> ======================== >>>>> >>>>> Now we need to consider what information needs to exist on a db in >>>>> order to figure out if we *can* short circuit a replication as well as >>>>> where we fast forward *to*. >>>>> >>>>> One obvious piece of information is the UUID of the database stream. A >>>>> second piece would be the update_seq for that UUID. After some thought >>>>> we also realize we need to store some more information to check if that >>>>> UUID-update_seq pair is still valid when we go to fast-forward. >>>>> >>>>> The case that could invalidate a pair is if a database crashes and it >>>>> needs to be restored. Consider if A replicates to B replicates to C. C >>>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this >>>>> thought experiment. Now at some point after C learns of A, A crashes and >>>>> is restored from backup. Now A is at update_seq 5. Now we go on with >>>>> our business and write 5 docs to A. But we also write 5 *different* docs >>>>> than we wrote before the restore. This divergence in history would not >>>>> be detectable without extra information. >>>>> >>>>> After much hand waving about rolling hashes, Adam decided to remember >>>>> that we store a replication history between two db's. This can be >>>>> represented as a _local doc id that includes information on the pair >>>>> of db's as well as a random session id. If we include this data with >>>>> the UUID-update_seq pair, when we check if a short circuit is possible >>>>> we can check that this record still exists. >>>>> >>>>> In the case of the crash/restore even if we go and make the same edits >>>>> and even have a similar replication history, the randomness to the >>>>> session id will inform us that something has gone awry and we need to >>>>> run a full replication to make sure we have all history. >>>>> >>>>> >>>>> Information Required to Trigger Propagation >>>>> =========================================== >>>>> >>>>> Along with the four pieces of information mentioned above, we also need >>>>> to store what update_seq in the target database was the *result* of a >>>>> replication. Ie, when we replicate A -> B, B needs to know the final >>>>> update_seq of that replication transaction. This is so that when B >>>>> replicates to C, it knows when to tell C about A. We can't do this at the >>>>> very beginning because the replication might fail before all of the >>>>> info from A is replicated. We also can't wait until the end because then >>>>> C may never learn of A because of failure. >>>>> >>>>> This means that we need to know for a given update_seq if after it has >>>>> been replicated, C can suddenly fast-forward a replication with someone >>>>> other than B. To do this B will need to be able to stream its update >>>>> sequence and efficiently check if that completes some replication record >>>>> that C should know about. >>>>> >>>>> We might quickly assume that storing this in the existing update seq >>>>> b+tree would be kosher, but it isn't. Consider the case where update_seq >>>>> 6 on B is the end of the replication A -> B. Now consider that B starts >>>>> replicating to C while someone starts updating the doc for update_seq >>>>> 6 on B. Its possible that a series of events could lead to C never >>>>> learning of A because the update_seq for the doc id from 6 keeps jumping >>>>> to the latest update_seq. >>>>> >>>>> The proper way to fix this would be to insert code that says "when an >>>>> update_seq entry is updated, move its replication info to the next update >>>>> seq" which sounds like it could get really quite wonky. >>>>> >>>>> So the solution would be to have some sort of indexed structure of >>>>> replication records that can be scanned to know when to send out some >>>>> replication finished.... >>>>> >>>>> Ruh Roh >>>>> ======= >>>>> >>>>> I just realized something wonky with this whole plan. We *don't* >>>>> necessarily know when a replication ends because of update sequences. For >>>>> instance, if we replicate A -> B, and then edit a doc from A on B, and >>>>> then >>>>> replicate B -> C, can we ever know when to short circuit a replication? >>>>> >>>>> This could be a huge gaping whole. Someone prove me wrong. >>>>> >>>>> Storing Replication State >>>>> ========================= >>>>> >>>>> With this new piece of information we'll also require some way to store >>>>> replication state. This should hopefully be hand-wavy trivial by just >>>>> storing replication records in _local docs very similarly to how they're >>>>> currently stored. >>>>> >>>>> </brain_dump> >>>>> >>>> >>>> The important point of my ruh roh to realize that I failed to >>>> articulate, the reason that this is bad is that if when we edit the >>>> doc on B before replication to C, C *can't* know what's on A until it >>>> gets to the new version of the doc in B. This coupled with the fact >>>> that we can edit anything on B, and that they all jump to the end >>>> makes me think that we'd have to do some more extensive bookkeeping to >>>> make sure that C doesn't know about B until after all of A's docs get >>>> pushed. >>>> >>>> Blargghhh.... >>>> >>> >>> Doesn't know about A until all of A's docs get pushed. Its late. I'm out. >>> >> >> After sleeping on it, I think that this doesn't shoot the whole idea >> out of the sky, but requires us to only send the info when a >> replication manages to reach the end of the update_seq btree in a >> single db snapshot. I'm not sure if that means that it'd be out of the >> question for continuous replication or not. > > Hi Paul, thanks for this articulate writeup. I think you're correct in this > last email, we can only send these extra bits of information about other > replications whenever we've reached the end of an MVCC snapshot from the > current source. That shouldn't be a problem for continuous replication, > since under the hood it's implemented as a loop of "open / walk seq_tree / > wait for new updates" calls. We can just send any new transitive checkpoints > that we encountered during the current walk just before going into the "wait > for new updates" step. > > Adam
heh, articulate. Sounds good for continuous replication. I wasn't certain if it just held on to an open changes feed or not, but if not then it sounds like it'll all be clean enough.
