On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> wrote: > On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> > wrote: >> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <[email protected]> >> wrote: >>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <[email protected]> wrote: >>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way >>>> to fast-forward replications (thanks Max for the prodding!). It's >>>> non-trivial, but I think the benefit for big networks of CouchDB servers >>>> can be substantial. >>>> >>>> The basic idea is that if A replicates with B, and B with C, then a new >>>> replication between A and C should not need to start from scratch. I >>>> think we can accomplish this as follows: >>>> >>>> 1) Store the target update sequence along with the source sequence in the >>>> checkpoint document, at least in the checkpoint document on the target. >>>> The following tuple is important: {Source, _local ID, Session ID, >>>> SourceSeq, TargetSeq}. Using that syntax let's say we have the following >>>> replication records: >>>> >>>> On A >>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on >>>> the source >>>> >>>> On B >>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B >>>> {B, _local/Baz, Bif, 15, _TargetSeq} >>>> >>>> On C >>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C >>>> >>>> We know that A -> B happened before B -> C. >>>> >>>> 2) During the B -> C replication, when we reach source sequence number 10, >>>> the _changes feed from B will deliver some extra information like >>>> >>>> {A, _local/Foo, Bar, 5} >>>> >>>> which will be stored at C. This may require a new disk-resident btree >>>> keyed on update sequence, or at least an in-memory index constructed by >>>> walking the _local docs btree. >>>> >>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint >>>> records in its _local tree and find no mention of A, but then it will also >>>> consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} >>>> record. It'll consult _local/Foo on A, find that the session ID Bar is >>>> still present, and conclude that it can fast-forward the replication and >>>> start from update sequence 5. It will then remove that transitive >>>> checkpoint and replace it with a full regular checkpoint. >>>> >>>> If server A crashes after the A -> B replication and restores from a >>>> backup that was recorded before the replication, the session ID Bar will >>>> be missing from _local/Foo, so when we try to do the A -> replication we >>>> won't fast forward. This is the correct behavior. >>>> >>>> Hopefully this is comprehensible to someone other than me. We spent some >>>> time trying to poke holes in it, but it's entirely possible there are >>>> other things we didn't consider that will prevent it from working. Cheers, >>>> >>>> Adam >>> >>> What Adam said. Also, I was just doing a brain dump and I think I >>> might've punched a gaping whole into the whole scenario. I'm not >>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh" >>> towards the end where my brain dump froze up. Its late so maybe I'm >>> just not seeing the easy way around it. >>> >>> There's also a picture of the end of our white board session at >>> http://plixi.com/p/78268064 which probably means little to nothing >>> without the context of having seen it drawn and the ensuing argument >>> and wild gesticulations. But its there for posterity. >>> >>> <brain_dump> >>> >>> Transitive Replication - The Idea >>> ================================= >>> >>> Consider the following scenario: >>> >>> 1. Replicate A -> B >>> 2. Replicate B -> C >>> 3. Replicate A -> C >>> >>> For simplicity's sake, assume no writes occur during this scenario. The >>> question is why can't we short circuit step 3 to effectively be a no-op? >>> >>> Current Situation >>> ================= >>> >>> Replication state is based on a pair-wise state reflecting source and >>> target information (and filter functions etc). For the above scenario to >>> be anywhere near plausible a couple things need to happen. First, we'll >>> obviously need to transfer data from B -> C during replication so it >>> has knowledge about A. This information will have to be complete enough >>> to short circuit (or skip part of) a replication from A. >>> >>> The information that B sends to C will need to enable a replication from >>> A to C to occur without error in any sort of pathological state of A >>> irregardless of what state C thinks A is in. Changes in state may include >>> A "forgetting" some edits and resetting to a point in time the state >>> that C has (for instance, A crashed and was recovered to a previous >>> point in time). >>> >>> C will also need to be able to uniquely identify A regardless of host or >>> other transitory characteristics. >>> >>> An Old Proposition >>> ================== >>> >>> There's been a proposal floated a few times for a few different reasons >>> to give each database a UUID so that it is uniquely identifiable for >>> various reasons (ETags come to mind). Such a UUID were it to exist would >>> allow us to uniquely identify a database in the above scenario. >>> >>> The first issue with db UUID's that always pops up is that we have to >>> address the case of what happens when someone copies a database (perhaps >>> to short circuit an initial replication, or restoring a db when a >>> machine fails) is that the UUID may no longer be globally unique. >>> >>> This would need to be fixed for transitive replication to have any >>> chance of working. One solution that was mentioned was to have each >>> CouchDB node remember all UUID's that it knows about and if a db is >>> opened with an unknown UUID, that db gets a new UUID assigned. >>> >>> This could be accomplished efficiently by storing _local docs in the >>> replicator database that reference known UUID/dbname pairs. Then we >>> just lookup the UUID on db open and if it doesn't match the db name >>> we reset it. >>> >>> For upgrade compatibility and the ability to change UUID's often we >>> could just store the UUID in the db header (as opposed to the first >>> sixteen bytes of the file). >>> >>> Information Propagation Requirements >>> ==================================== >>> >>> When replication occurs we need to inform the target database of a >>> few pieces of information so that it knows about transitive replications >>> that it contains. We also need to make sure that the target db doesn't >>> learn about this information before it contains the entire replica set >>> and it needs to be processed in such a way that it doesn't require >>> complete replications. >>> >>> These requirements pretty much lead us to the fact that the replica >>> state will need to be beamed across as the target receives information >>> from the source update sequence. Ie, when we iterate the _changes feed >>> we get extra info when we've arrived an update_seq that wholly contains >>> some prior replication from an arbitrary node to the *source*. >>> >>> Information to Propagate >>> ======================== >>> >>> Now we need to consider what information needs to exist on a db in >>> order to figure out if we *can* short circuit a replication as well as >>> where we fast forward *to*. >>> >>> One obvious piece of information is the UUID of the database stream. A >>> second piece would be the update_seq for that UUID. After some thought >>> we also realize we need to store some more information to check if that >>> UUID-update_seq pair is still valid when we go to fast-forward. >>> >>> The case that could invalidate a pair is if a database crashes and it >>> needs to be restored. Consider if A replicates to B replicates to C. C >>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this >>> thought experiment. Now at some point after C learns of A, A crashes and >>> is restored from backup. Now A is at update_seq 5. Now we go on with >>> our business and write 5 docs to A. But we also write 5 *different* docs >>> than we wrote before the restore. This divergence in history would not >>> be detectable without extra information. >>> >>> After much hand waving about rolling hashes, Adam decided to remember >>> that we store a replication history between two db's. This can be >>> represented as a _local doc id that includes information on the pair >>> of db's as well as a random session id. If we include this data with >>> the UUID-update_seq pair, when we check if a short circuit is possible >>> we can check that this record still exists. >>> >>> In the case of the crash/restore even if we go and make the same edits >>> and even have a similar replication history, the randomness to the >>> session id will inform us that something has gone awry and we need to >>> run a full replication to make sure we have all history. >>> >>> >>> Information Required to Trigger Propagation >>> =========================================== >>> >>> Along with the four pieces of information mentioned above, we also need >>> to store what update_seq in the target database was the *result* of a >>> replication. Ie, when we replicate A -> B, B needs to know the final >>> update_seq of that replication transaction. This is so that when B >>> replicates to C, it knows when to tell C about A. We can't do this at the >>> very beginning because the replication might fail before all of the >>> info from A is replicated. We also can't wait until the end because then >>> C may never learn of A because of failure. >>> >>> This means that we need to know for a given update_seq if after it has >>> been replicated, C can suddenly fast-forward a replication with someone >>> other than B. To do this B will need to be able to stream its update >>> sequence and efficiently check if that completes some replication record >>> that C should know about. >>> >>> We might quickly assume that storing this in the existing update seq >>> b+tree would be kosher, but it isn't. Consider the case where update_seq >>> 6 on B is the end of the replication A -> B. Now consider that B starts >>> replicating to C while someone starts updating the doc for update_seq >>> 6 on B. Its possible that a series of events could lead to C never >>> learning of A because the update_seq for the doc id from 6 keeps jumping >>> to the latest update_seq. >>> >>> The proper way to fix this would be to insert code that says "when an >>> update_seq entry is updated, move its replication info to the next update >>> seq" which sounds like it could get really quite wonky. >>> >>> So the solution would be to have some sort of indexed structure of >>> replication records that can be scanned to know when to send out some >>> replication finished.... >>> >>> Ruh Roh >>> ======= >>> >>> I just realized something wonky with this whole plan. We *don't* >>> necessarily know when a replication ends because of update sequences. For >>> instance, if we replicate A -> B, and then edit a doc from A on B, and then >>> replicate B -> C, can we ever know when to short circuit a replication? >>> >>> This could be a huge gaping whole. Someone prove me wrong. >>> >>> Storing Replication State >>> ========================= >>> >>> With this new piece of information we'll also require some way to store >>> replication state. This should hopefully be hand-wavy trivial by just >>> storing replication records in _local docs very similarly to how they're >>> currently stored. >>> >>> </brain_dump> >>> >> >> The important point of my ruh roh to realize that I failed to >> articulate, the reason that this is bad is that if when we edit the >> doc on B before replication to C, C *can't* know what's on A until it >> gets to the new version of the doc in B. This coupled with the fact >> that we can edit anything on B, and that they all jump to the end >> makes me think that we'd have to do some more extensive bookkeeping to >> make sure that C doesn't know about B until after all of A's docs get >> pushed. >> >> Blargghhh.... >> > > Doesn't know about A until all of A's docs get pushed. Its late. I'm out. >
After sleeping on it, I think that this doesn't shoot the whole idea out of the sky, but requires us to only send the info when a replication manages to reach the end of the update_seq btree in a single db snapshot. I'm not sure if that means that it'd be out of the question for continuous replication or not.
