On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> wrote: > On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <[email protected]> > wrote: >> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <[email protected]> wrote: >>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way >>> to fast-forward replications (thanks Max for the prodding!). It's >>> non-trivial, but I think the benefit for big networks of CouchDB servers >>> can be substantial. >>> >>> The basic idea is that if A replicates with B, and B with C, then a new >>> replication between A and C should not need to start from scratch. I think >>> we can accomplish this as follows: >>> >>> 1) Store the target update sequence along with the source sequence in the >>> checkpoint document, at least in the checkpoint document on the target. >>> The following tuple is important: {Source, _local ID, Session ID, >>> SourceSeq, TargetSeq}. Using that syntax let's say we have the following >>> replication records: >>> >>> On A >>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on >>> the source >>> >>> On B >>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B >>> {B, _local/Baz, Bif, 15, _TargetSeq} >>> >>> On C >>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C >>> >>> We know that A -> B happened before B -> C. >>> >>> 2) During the B -> C replication, when we reach source sequence number 10, >>> the _changes feed from B will deliver some extra information like >>> >>> {A, _local/Foo, Bar, 5} >>> >>> which will be stored at C. This may require a new disk-resident btree keyed >>> on update sequence, or at least an in-memory index constructed by walking >>> the _local docs btree. >>> >>> 3) When we trigger the A -> C replication, C will walk the full checkpoint >>> records in its _local tree and find no mention of A, but then it will also >>> consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} >>> record. It'll consult _local/Foo on A, find that the session ID Bar is >>> still present, and conclude that it can fast-forward the replication and >>> start from update sequence 5. It will then remove that transitive >>> checkpoint and replace it with a full regular checkpoint. >>> >>> If server A crashes after the A -> B replication and restores from a backup >>> that was recorded before the replication, the session ID Bar will be >>> missing from _local/Foo, so when we try to do the A -> replication we won't >>> fast forward. This is the correct behavior. >>> >>> Hopefully this is comprehensible to someone other than me. We spent some >>> time trying to poke holes in it, but it's entirely possible there are other >>> things we didn't consider that will prevent it from working. Cheers, >>> >>> Adam >> >> What Adam said. Also, I was just doing a brain dump and I think I >> might've punched a gaping whole into the whole scenario. I'm not >> entirely certain yet, but it seems ungood. There's a section "Ruh Roh" >> towards the end where my brain dump froze up. Its late so maybe I'm >> just not seeing the easy way around it. >> >> There's also a picture of the end of our white board session at >> http://plixi.com/p/78268064 which probably means little to nothing >> without the context of having seen it drawn and the ensuing argument >> and wild gesticulations. But its there for posterity. >> >> <brain_dump> >> >> Transitive Replication - The Idea >> ================================= >> >> Consider the following scenario: >> >> 1. Replicate A -> B >> 2. Replicate B -> C >> 3. Replicate A -> C >> >> For simplicity's sake, assume no writes occur during this scenario. The >> question is why can't we short circuit step 3 to effectively be a no-op? >> >> Current Situation >> ================= >> >> Replication state is based on a pair-wise state reflecting source and >> target information (and filter functions etc). For the above scenario to >> be anywhere near plausible a couple things need to happen. First, we'll >> obviously need to transfer data from B -> C during replication so it >> has knowledge about A. This information will have to be complete enough >> to short circuit (or skip part of) a replication from A. >> >> The information that B sends to C will need to enable a replication from >> A to C to occur without error in any sort of pathological state of A >> irregardless of what state C thinks A is in. Changes in state may include >> A "forgetting" some edits and resetting to a point in time the state >> that C has (for instance, A crashed and was recovered to a previous >> point in time). >> >> C will also need to be able to uniquely identify A regardless of host or >> other transitory characteristics. >> >> An Old Proposition >> ================== >> >> There's been a proposal floated a few times for a few different reasons >> to give each database a UUID so that it is uniquely identifiable for >> various reasons (ETags come to mind). Such a UUID were it to exist would >> allow us to uniquely identify a database in the above scenario. >> >> The first issue with db UUID's that always pops up is that we have to >> address the case of what happens when someone copies a database (perhaps >> to short circuit an initial replication, or restoring a db when a >> machine fails) is that the UUID may no longer be globally unique. >> >> This would need to be fixed for transitive replication to have any >> chance of working. One solution that was mentioned was to have each >> CouchDB node remember all UUID's that it knows about and if a db is >> opened with an unknown UUID, that db gets a new UUID assigned. >> >> This could be accomplished efficiently by storing _local docs in the >> replicator database that reference known UUID/dbname pairs. Then we >> just lookup the UUID on db open and if it doesn't match the db name >> we reset it. >> >> For upgrade compatibility and the ability to change UUID's often we >> could just store the UUID in the db header (as opposed to the first >> sixteen bytes of the file). >> >> Information Propagation Requirements >> ==================================== >> >> When replication occurs we need to inform the target database of a >> few pieces of information so that it knows about transitive replications >> that it contains. We also need to make sure that the target db doesn't >> learn about this information before it contains the entire replica set >> and it needs to be processed in such a way that it doesn't require >> complete replications. >> >> These requirements pretty much lead us to the fact that the replica >> state will need to be beamed across as the target receives information >> from the source update sequence. Ie, when we iterate the _changes feed >> we get extra info when we've arrived an update_seq that wholly contains >> some prior replication from an arbitrary node to the *source*. >> >> Information to Propagate >> ======================== >> >> Now we need to consider what information needs to exist on a db in >> order to figure out if we *can* short circuit a replication as well as >> where we fast forward *to*. >> >> One obvious piece of information is the UUID of the database stream. A >> second piece would be the update_seq for that UUID. After some thought >> we also realize we need to store some more information to check if that >> UUID-update_seq pair is still valid when we go to fast-forward. >> >> The case that could invalidate a pair is if a database crashes and it >> needs to be restored. Consider if A replicates to B replicates to C. C >> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this >> thought experiment. Now at some point after C learns of A, A crashes and >> is restored from backup. Now A is at update_seq 5. Now we go on with >> our business and write 5 docs to A. But we also write 5 *different* docs >> than we wrote before the restore. This divergence in history would not >> be detectable without extra information. >> >> After much hand waving about rolling hashes, Adam decided to remember >> that we store a replication history between two db's. This can be >> represented as a _local doc id that includes information on the pair >> of db's as well as a random session id. If we include this data with >> the UUID-update_seq pair, when we check if a short circuit is possible >> we can check that this record still exists. >> >> In the case of the crash/restore even if we go and make the same edits >> and even have a similar replication history, the randomness to the >> session id will inform us that something has gone awry and we need to >> run a full replication to make sure we have all history. >> >> >> Information Required to Trigger Propagation >> =========================================== >> >> Along with the four pieces of information mentioned above, we also need >> to store what update_seq in the target database was the *result* of a >> replication. Ie, when we replicate A -> B, B needs to know the final >> update_seq of that replication transaction. This is so that when B >> replicates to C, it knows when to tell C about A. We can't do this at the >> very beginning because the replication might fail before all of the >> info from A is replicated. We also can't wait until the end because then >> C may never learn of A because of failure. >> >> This means that we need to know for a given update_seq if after it has >> been replicated, C can suddenly fast-forward a replication with someone >> other than B. To do this B will need to be able to stream its update >> sequence and efficiently check if that completes some replication record >> that C should know about. >> >> We might quickly assume that storing this in the existing update seq >> b+tree would be kosher, but it isn't. Consider the case where update_seq >> 6 on B is the end of the replication A -> B. Now consider that B starts >> replicating to C while someone starts updating the doc for update_seq >> 6 on B. Its possible that a series of events could lead to C never >> learning of A because the update_seq for the doc id from 6 keeps jumping >> to the latest update_seq. >> >> The proper way to fix this would be to insert code that says "when an >> update_seq entry is updated, move its replication info to the next update >> seq" which sounds like it could get really quite wonky. >> >> So the solution would be to have some sort of indexed structure of >> replication records that can be scanned to know when to send out some >> replication finished.... >> >> Ruh Roh >> ======= >> >> I just realized something wonky with this whole plan. We *don't* >> necessarily know when a replication ends because of update sequences. For >> instance, if we replicate A -> B, and then edit a doc from A on B, and then >> replicate B -> C, can we ever know when to short circuit a replication? >> >> This could be a huge gaping whole. Someone prove me wrong. >> >> Storing Replication State >> ========================= >> >> With this new piece of information we'll also require some way to store >> replication state. This should hopefully be hand-wavy trivial by just >> storing replication records in _local docs very similarly to how they're >> currently stored. >> >> </brain_dump> >> > > The important point of my ruh roh to realize that I failed to > articulate, the reason that this is bad is that if when we edit the > doc on B before replication to C, C *can't* know what's on A until it > gets to the new version of the doc in B. This coupled with the fact > that we can edit anything on B, and that they all jump to the end > makes me think that we'd have to do some more extensive bookkeeping to > make sure that C doesn't know about B until after all of A's docs get > pushed. > > Blargghhh.... >
Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
