Re: idea for transitive replication checkpoints

Paul Davis Fri, 18 Feb 2011 08:17:56 -0800

On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> wrote:
> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> 
> wrote:
>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <[email protected]> 
>> wrote:
>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <[email protected]> wrote:
>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way 
>>>> to fast-forward replications (thanks Max for the prodding!).  It's 
>>>> non-trivial, but I think the benefit for big networks of CouchDB servers 
>>>> can be substantial.
>>>>
>>>> The basic idea is that if A replicates with B, and B with C, then a new 
>>>> replication between A and C should not need to start from scratch.  I 
>>>> think we can accomplish this as follows:
>>>>
>>>> 1) Store the target update sequence along with the source sequence in the 
>>>> checkpoint document, at least in the checkpoint document on the target.  
>>>> The following tuple is important: {Source, _local ID, Session ID, 
>>>> SourceSeq, TargetSeq}.  Using that syntax let's say we have the following 
>>>> replication records:
>>>>
>>>> On A
>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on 
>>>> the source
>>>>
>>>> On B
>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>>
>>>> On C
>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>>
>>>> We know that A -> B happened before B -> C.
>>>>
>>>> 2) During the B -> C replication, when we reach source sequence number 10, 
>>>> the _changes feed from B will deliver some extra information like
>>>>
>>>> {A, _local/Foo, Bar, 5}
>>>>
>>>> which will be stored at C. This may require a new disk-resident btree 
>>>> keyed on update sequence, or at least an in-memory index constructed by 
>>>> walking the _local docs btree.
>>>>
>>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint 
>>>> records in its _local tree and find no mention of A, but then it will also 
>>>> consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5} 
>>>> record.  It'll consult _local/Foo on A, find that the session ID Bar is 
>>>> still present, and conclude that it can fast-forward the replication and 
>>>> start from update sequence 5.  It will then remove that transitive 
>>>> checkpoint and replace it with a full regular checkpoint.
>>>>
>>>> If server A crashes after the A -> B replication and restores from a 
>>>> backup that was recorded before the replication, the session ID Bar will 
>>>> be missing from _local/Foo, so when we try to do the A -> replication we 
>>>> won't fast forward.  This is the correct behavior.
>>>>
>>>> Hopefully this is comprehensible to someone other than me.  We spent some 
>>>> time trying to poke holes in it, but it's entirely possible there are 
>>>> other things we didn't consider that will prevent it from working.  Cheers,
>>>>
>>>> Adam
>>>
>>> What Adam said. Also, I was just doing a brain dump and I think I
>>> might've punched a gaping whole into the whole scenario. I'm not
>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>>> towards the end where my brain dump froze up. Its late so maybe I'm
>>> just not seeing the easy way around it.
>>>
>>> There's also a picture of the end of our white board session at
>>> http://plixi.com/p/78268064 which probably means little to nothing
>>> without the context of having seen it drawn and the ensuing argument
>>> and wild gesticulations. But its there for posterity.
>>>
>>> <brain_dump>
>>>
>>> Transitive Replication - The Idea
>>> =================================
>>>
>>> Consider the following scenario:
>>>
>>> 1. Replicate A -> B
>>> 2. Replicate B -> C
>>> 3. Replicate A -> C
>>>
>>> For simplicity's sake, assume no writes occur during this scenario. The
>>> question is why can't we short circuit step 3 to effectively be a no-op?
>>>
>>> Current Situation
>>> =================
>>>
>>> Replication state is based on a pair-wise state reflecting source and
>>> target information (and filter functions etc). For the above scenario to
>>> be anywhere near plausible a couple things need to happen. First, we'll
>>> obviously need to transfer data from B -> C during replication so it
>>> has knowledge about A. This information will have to be complete enough
>>> to short circuit (or skip part of) a replication from A.
>>>
>>> The information that B sends to C will need to enable a replication from
>>> A to C to occur without error in any sort of pathological state of A
>>> irregardless of what state C thinks A is in. Changes in state may include
>>> A "forgetting" some edits and resetting to a point in time the state
>>> that C has (for instance, A crashed and was recovered to a previous
>>> point in time).
>>>
>>> C will also need to be able to uniquely identify A regardless of host or
>>> other transitory characteristics.
>>>
>>> An Old Proposition
>>> ==================
>>>
>>> There's been a proposal floated a few times for a few different reasons
>>> to give each database a UUID so that it is uniquely identifiable for
>>> various reasons (ETags come to mind). Such a UUID were it to exist would
>>> allow us to uniquely identify a database in the above scenario.
>>>
>>> The first issue with db UUID's that always pops up is that we have to
>>> address the case of what happens when someone copies a database (perhaps
>>> to short circuit an initial replication, or restoring a db when a
>>> machine fails) is that the UUID may no longer be globally unique.
>>>
>>> This would need to be fixed for transitive replication to have any
>>> chance of working. One solution that was mentioned was to have each
>>> CouchDB node remember all UUID's that it knows about and if a db is
>>> opened with an unknown UUID, that db gets a new UUID assigned.
>>>
>>> This could be accomplished efficiently by storing _local docs in the
>>> replicator database that reference known UUID/dbname pairs. Then we
>>> just lookup the UUID on db open and if it doesn't match the db name
>>> we reset it.
>>>
>>> For upgrade compatibility and the ability to change UUID's often we
>>> could just store the UUID in the db header (as opposed to the first
>>> sixteen bytes of the file).
>>>
>>> Information Propagation Requirements
>>> ====================================
>>>
>>> When replication occurs we need to inform the target database of a
>>> few pieces of information so that it knows about transitive replications
>>> that it contains. We also need to make sure that the target db doesn't
>>> learn about this information before it contains the entire replica set
>>> and it needs to be processed in such a way that it doesn't require
>>> complete replications.
>>>
>>> These requirements pretty much lead us to the fact that the replica
>>> state will need to be beamed across as the target receives information
>>> from the source update sequence. Ie, when we iterate the _changes feed
>>> we get extra info when we've arrived an update_seq that wholly contains
>>> some prior replication from an arbitrary node to the *source*.
>>>
>>> Information to Propagate
>>> ========================
>>>
>>> Now we need to consider what information needs to exist on a db in
>>> order to figure out if we *can* short circuit a replication as well as
>>> where we fast forward *to*.
>>>
>>> One obvious piece of information is the UUID of the database stream. A
>>> second piece would be the update_seq for that UUID. After some thought
>>> we also realize we need to store some more information to check if that
>>> UUID-update_seq pair is still valid when we go to fast-forward.
>>>
>>> The case that could invalidate a pair is if a database crashes and it
>>> needs to be restored. Consider if A replicates to B replicates to C. C
>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>>> thought experiment. Now at some point after C learns of A, A crashes and
>>> is restored from backup. Now A is at update_seq 5. Now we go on with
>>> our business and write 5 docs to A. But we also write 5 *different* docs
>>> than we wrote before the restore. This divergence in history would not
>>> be detectable without extra information.
>>>
>>> After much hand waving about rolling hashes, Adam decided to remember
>>> that we store a replication history between two db's. This can be
>>> represented as a _local doc id that includes information on the pair
>>> of db's as well as a random session id. If we include this data with
>>> the UUID-update_seq pair, when we check if a short circuit is possible
>>> we can check that this record still exists.
>>>
>>> In the case of the crash/restore even if we go and make the same edits
>>> and even have a similar replication history, the randomness to the
>>> session id will inform us that something has gone awry and we need to
>>> run a full replication to make sure we have all history.
>>>
>>>
>>> Information Required to Trigger Propagation
>>> ===========================================
>>>
>>> Along with the four pieces of information mentioned above, we also need
>>> to store what update_seq in the target database was the *result* of a
>>> replication. Ie, when we replicate A -> B, B needs to know the final
>>> update_seq of that replication transaction. This is so that when B
>>> replicates to C, it knows when to tell C about A. We can't do this at the
>>> very beginning because the replication might fail before all of the
>>> info from A is replicated. We also can't wait until the end because then
>>> C may never learn of A because of failure.
>>>
>>> This means that we need to know for a given update_seq if after it has
>>> been replicated, C can suddenly fast-forward a replication with someone
>>> other than B. To do this B will need to be able to stream its update
>>> sequence and efficiently check if that completes some replication record
>>> that C should know about.
>>>
>>> We might quickly assume that storing this in the existing update seq
>>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>>> 6 on B is the end of the replication A -> B. Now consider that B starts
>>> replicating to C while someone starts updating the doc for update_seq
>>> 6 on B. Its possible that a series of events could lead to C never
>>> learning of A because the update_seq for the doc id from 6 keeps jumping
>>> to the latest update_seq.
>>>
>>> The proper way to fix this would be to insert code that says "when an
>>> update_seq entry is updated, move its replication info to the next update
>>> seq" which sounds like it could get really quite wonky.
>>>
>>> So the solution would be to have some sort of indexed structure of
>>> replication records that can be scanned to know when to send out some
>>> replication finished....
>>>
>>> Ruh Roh
>>> =======
>>>
>>> I just realized something wonky with this whole plan. We *don't*
>>> necessarily know when a replication ends because of update sequences. For
>>> instance, if we replicate A -> B, and then edit a doc from A on B, and then
>>> replicate B -> C, can we ever know when to short circuit a replication?
>>>
>>> This could be a huge gaping whole. Someone prove me wrong.
>>>
>>> Storing Replication State
>>> =========================
>>>
>>> With this new piece of information we'll also require some way to store
>>> replication state. This should hopefully be hand-wavy trivial by just
>>> storing replication records in _local docs very similarly to how they're
>>> currently stored.
>>>
>>> </brain_dump>
>>>
>>
>> The important point of my ruh roh to realize that I failed to
>> articulate, the reason that this is bad is that if when we edit the
>> doc on B before replication to C, C *can't* know what's on A until it
>> gets to the new version of the doc in B. This coupled with the fact
>> that we can edit anything on B, and that they all jump to the end
>> makes me think that we'd have to do some more extensive bookkeeping to
>> make sure that C doesn't know about B until after all of A's docs get
>> pushed.
>>
>> Blargghhh....
>>
>
> Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
>


After sleeping on it, I think that this doesn't shoot the whole idea
out of the sky, but requires us to only send the info when a
replication manages to reach the end of the update_seq btree in a
single db snapshot. I'm not sure if that means that it'd be out of the
question for continuous replication or not.

Re: idea for transitive replication checkpoints

Reply via email to