Re: idea for transitive replication checkpoints

Paul Davis Thu, 17 Feb 2011 22:25:35 -0800

On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> wrote:
> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <[email protected]> 
> wrote:
>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <[email protected]> wrote:
>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way 
>>> to fast-forward replications (thanks Max for the prodding!).  It's 
>>> non-trivial, but I think the benefit for big networks of CouchDB servers 
>>> can be substantial.
>>>
>>> The basic idea is that if A replicates with B, and B with C, then a new 
>>> replication between A and C should not need to start from scratch.  I think 
>>> we can accomplish this as follows:
>>>
>>> 1) Store the target update sequence along with the source sequence in the 
>>> checkpoint document, at least in the checkpoint document on the target.  
>>> The following tuple is important: {Source, _local ID, Session ID, 
>>> SourceSeq, TargetSeq}.  Using that syntax let's say we have the following 
>>> replication records:
>>>
>>> On A
>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on 
>>> the source
>>>
>>> On B
>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>
>>> On C
>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>
>>> We know that A -> B happened before B -> C.
>>>
>>> 2) During the B -> C replication, when we reach source sequence number 10, 
>>> the _changes feed from B will deliver some extra information like
>>>
>>> {A, _local/Foo, Bar, 5}
>>>
>>> which will be stored at C. This may require a new disk-resident btree keyed 
>>> on update sequence, or at least an in-memory index constructed by walking 
>>> the _local docs btree.
>>>
>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint 
>>> records in its _local tree and find no mention of A, but then it will also 
>>> consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} 
>>> record.  It'll consult _local/Foo on A, find that the session ID Bar is 
>>> still present, and conclude that it can fast-forward the replication and 
>>> start from update sequence 5.  It will then remove that transitive 
>>> checkpoint and replace it with a full regular checkpoint.
>>>
>>> If server A crashes after the A -> B replication and restores from a backup 
>>> that was recorded before the replication, the session ID Bar will be 
>>> missing from _local/Foo, so when we try to do the A -> replication we won't 
>>> fast forward.  This is the correct behavior.
>>>
>>> Hopefully this is comprehensible to someone other than me.  We spent some 
>>> time trying to poke holes in it, but it's entirely possible there are other 
>>> things we didn't consider that will prevent it from working.  Cheers,
>>>
>>> Adam
>>
>> What Adam said. Also, I was just doing a brain dump and I think I
>> might've punched a gaping whole into the whole scenario. I'm not
>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>> towards the end where my brain dump froze up. Its late so maybe I'm
>> just not seeing the easy way around it.
>>
>> There's also a picture of the end of our white board session at
>> http://plixi.com/p/78268064 which probably means little to nothing
>> without the context of having seen it drawn and the ensuing argument
>> and wild gesticulations. But its there for posterity.
>>
>> <brain_dump>
>>
>> Transitive Replication - The Idea
>> =================================
>>
>> Consider the following scenario:
>>
>> 1. Replicate A -> B
>> 2. Replicate B -> C
>> 3. Replicate A -> C
>>
>> For simplicity's sake, assume no writes occur during this scenario. The
>> question is why can't we short circuit step 3 to effectively be a no-op?
>>
>> Current Situation
>> =================
>>
>> Replication state is based on a pair-wise state reflecting source and
>> target information (and filter functions etc). For the above scenario to
>> be anywhere near plausible a couple things need to happen. First, we'll
>> obviously need to transfer data from B -> C during replication so it
>> has knowledge about A. This information will have to be complete enough
>> to short circuit (or skip part of) a replication from A.
>>
>> The information that B sends to C will need to enable a replication from
>> A to C to occur without error in any sort of pathological state of A
>> irregardless of what state C thinks A is in. Changes in state may include
>> A "forgetting" some edits and resetting to a point in time the state
>> that C has (for instance, A crashed and was recovered to a previous
>> point in time).
>>
>> C will also need to be able to uniquely identify A regardless of host or
>> other transitory characteristics.
>>
>> An Old Proposition
>> ==================
>>
>> There's been a proposal floated a few times for a few different reasons
>> to give each database a UUID so that it is uniquely identifiable for
>> various reasons (ETags come to mind). Such a UUID were it to exist would
>> allow us to uniquely identify a database in the above scenario.
>>
>> The first issue with db UUID's that always pops up is that we have to
>> address the case of what happens when someone copies a database (perhaps
>> to short circuit an initial replication, or restoring a db when a
>> machine fails) is that the UUID may no longer be globally unique.
>>
>> This would need to be fixed for transitive replication to have any
>> chance of working. One solution that was mentioned was to have each
>> CouchDB node remember all UUID's that it knows about and if a db is
>> opened with an unknown UUID, that db gets a new UUID assigned.
>>
>> This could be accomplished efficiently by storing _local docs in the
>> replicator database that reference known UUID/dbname pairs. Then we
>> just lookup the UUID on db open and if it doesn't match the db name
>> we reset it.
>>
>> For upgrade compatibility and the ability to change UUID's often we
>> could just store the UUID in the db header (as opposed to the first
>> sixteen bytes of the file).
>>
>> Information Propagation Requirements
>> ====================================
>>
>> When replication occurs we need to inform the target database of a
>> few pieces of information so that it knows about transitive replications
>> that it contains. We also need to make sure that the target db doesn't
>> learn about this information before it contains the entire replica set
>> and it needs to be processed in such a way that it doesn't require
>> complete replications.
>>
>> These requirements pretty much lead us to the fact that the replica
>> state will need to be beamed across as the target receives information
>> from the source update sequence. Ie, when we iterate the _changes feed
>> we get extra info when we've arrived an update_seq that wholly contains
>> some prior replication from an arbitrary node to the *source*.
>>
>> Information to Propagate
>> ========================
>>
>> Now we need to consider what information needs to exist on a db in
>> order to figure out if we *can* short circuit a replication as well as
>> where we fast forward *to*.
>>
>> One obvious piece of information is the UUID of the database stream. A
>> second piece would be the update_seq for that UUID. After some thought
>> we also realize we need to store some more information to check if that
>> UUID-update_seq pair is still valid when we go to fast-forward.
>>
>> The case that could invalidate a pair is if a database crashes and it
>> needs to be restored. Consider if A replicates to B replicates to C. C
>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>> thought experiment. Now at some point after C learns of A, A crashes and
>> is restored from backup. Now A is at update_seq 5. Now we go on with
>> our business and write 5 docs to A. But we also write 5 *different* docs
>> than we wrote before the restore. This divergence in history would not
>> be detectable without extra information.
>>
>> After much hand waving about rolling hashes, Adam decided to remember
>> that we store a replication history between two db's. This can be
>> represented as a _local doc id that includes information on the pair
>> of db's as well as a random session id. If we include this data with
>> the UUID-update_seq pair, when we check if a short circuit is possible
>> we can check that this record still exists.
>>
>> In the case of the crash/restore even if we go and make the same edits
>> and even have a similar replication history, the randomness to the
>> session id will inform us that something has gone awry and we need to
>> run a full replication to make sure we have all history.
>>
>>
>> Information Required to Trigger Propagation
>> ===========================================
>>
>> Along with the four pieces of information mentioned above, we also need
>> to store what update_seq in the target database was the *result* of a
>> replication. Ie, when we replicate A -> B, B needs to know the final
>> update_seq of that replication transaction. This is so that when B
>> replicates to C, it knows when to tell C about A. We can't do this at the
>> very beginning because the replication might fail before all of the
>> info from A is replicated. We also can't wait until the end because then
>> C may never learn of A because of failure.
>>
>> This means that we need to know for a given update_seq if after it has
>> been replicated, C can suddenly fast-forward a replication with someone
>> other than B. To do this B will need to be able to stream its update
>> sequence and efficiently check if that completes some replication record
>> that C should know about.
>>
>> We might quickly assume that storing this in the existing update seq
>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>> 6 on B is the end of the replication A -> B. Now consider that B starts
>> replicating to C while someone starts updating the doc for update_seq
>> 6 on B. Its possible that a series of events could lead to C never
>> learning of A because the update_seq for the doc id from 6 keeps jumping
>> to the latest update_seq.
>>
>> The proper way to fix this would be to insert code that says "when an
>> update_seq entry is updated, move its replication info to the next update
>> seq" which sounds like it could get really quite wonky.
>>
>> So the solution would be to have some sort of indexed structure of
>> replication records that can be scanned to know when to send out some
>> replication finished....
>>
>> Ruh Roh
>> =======
>>
>> I just realized something wonky with this whole plan. We *don't*
>> necessarily know when a replication ends because of update sequences. For
>> instance, if we replicate A -> B, and then edit a doc from A on B, and then
>> replicate B -> C, can we ever know when to short circuit a replication?
>>
>> This could be a huge gaping whole. Someone prove me wrong.
>>
>> Storing Replication State
>> =========================
>>
>> With this new piece of information we'll also require some way to store
>> replication state. This should hopefully be hand-wavy trivial by just
>> storing replication records in _local docs very similarly to how they're
>> currently stored.
>>
>> </brain_dump>
>>
>
> The important point of my ruh roh to realize that I failed to
> articulate, the reason that this is bad is that if when we edit the
> doc on B before replication to C, C *can't* know what's on A until it
> gets to the new version of the doc in B. This coupled with the fact
> that we can edit anything on B, and that they all jump to the end
> makes me think that we'd have to do some more extensive bookkeeping to
> make sure that C doesn't know about B until after all of A's docs get
> pushed.
>
> Blargghhh....
>


Doesn't know about A until all of A's docs get pushed. Its late. I'm out.

Re: idea for transitive replication checkpoints

Reply via email to