Re: idea for transitive replication checkpoints

Paul Davis Fri, 18 Feb 2011 11:44:45 -0800

On Fri, Feb 18, 2011 at 2:33 PM, Adam Kocoloski <[email protected]> wrote:
> On Feb 18, 2011, at 11:16 AM, Paul Davis wrote:
>
>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> 
>> wrote:
>>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <[email protected]> 
>>> wrote:
>>>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <[email protected]> 
>>>> wrote:
>>>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <[email protected]> 
>>>>> wrote:
>>>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a 
>>>>>> way to fast-forward replications (thanks Max for the prodding!).  It's 
>>>>>> non-trivial, but I think the benefit for big networks of CouchDB servers 
>>>>>> can be substantial.
>>>>>>
>>>>>> The basic idea is that if A replicates with B, and B with C, then a new 
>>>>>> replication between A and C should not need to start from scratch.  I 
>>>>>> think we can accomplish this as follows:
>>>>>>
>>>>>> 1) Store the target update sequence along with the source sequence in 
>>>>>> the checkpoint document, at least in the checkpoint document on the 
>>>>>> target.  The following tuple is important: {Source, _local ID, Session 
>>>>>> ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have the 
>>>>>> following replication records:
>>>>>>
>>>>>> On A
>>>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence 
>>>>>> on the source
>>>>>>
>>>>>> On B
>>>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>>>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>>>>
>>>>>> On C
>>>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>>>>
>>>>>> We know that A -> B happened before B -> C.
>>>>>>
>>>>>> 2) During the B -> C replication, when we reach source sequence number 
>>>>>> 10, the _changes feed from B will deliver some extra information like
>>>>>>
>>>>>> {A, _local/Foo, Bar, 5}
>>>>>>
>>>>>> which will be stored at C. This may require a new disk-resident btree 
>>>>>> keyed on update sequence, or at least an in-memory index constructed by 
>>>>>> walking the _local docs btree.
>>>>>>
>>>>>> 3) When we trigger the A -> C replication, C will walk the full 
>>>>>> checkpoint records in its _local tree and find no mention of A, but then 
>>>>>> it will also consult the "transitive" checkpoints and find the {A, 
>>>>>> _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find that 
>>>>>> the session ID Bar is still present, and conclude that it can 
>>>>>> fast-forward the replication and start from update sequence 5.  It will 
>>>>>> then remove that transitive checkpoint and replace it with a full 
>>>>>> regular checkpoint.
>>>>>>
>>>>>> If server A crashes after the A -> B replication and restores from a 
>>>>>> backup that was recorded before the replication, the session ID Bar will 
>>>>>> be missing from _local/Foo, so when we try to do the A -> replication we 
>>>>>> won't fast forward.  This is the correct behavior.
>>>>>>
>>>>>> Hopefully this is comprehensible to someone other than me.  We spent 
>>>>>> some time trying to poke holes in it, but it's entirely possible there 
>>>>>> are other things we didn't consider that will prevent it from working.  
>>>>>> Cheers,
>>>>>>
>>>>>> Adam
>>>>>
>>>>> What Adam said. Also, I was just doing a brain dump and I think I
>>>>> might've punched a gaping whole into the whole scenario. I'm not
>>>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>>>>> towards the end where my brain dump froze up. Its late so maybe I'm
>>>>> just not seeing the easy way around it.
>>>>>
>>>>> There's also a picture of the end of our white board session at
>>>>> http://plixi.com/p/78268064 which probably means little to nothing
>>>>> without the context of having seen it drawn and the ensuing argument
>>>>> and wild gesticulations. But its there for posterity.
>>>>>
>>>>> <brain_dump>
>>>>>
>>>>> Transitive Replication - The Idea
>>>>> =================================
>>>>>
>>>>> Consider the following scenario:
>>>>>
>>>>> 1. Replicate A -> B
>>>>> 2. Replicate B -> C
>>>>> 3. Replicate A -> C
>>>>>
>>>>> For simplicity's sake, assume no writes occur during this scenario. The
>>>>> question is why can't we short circuit step 3 to effectively be a no-op?
>>>>>
>>>>> Current Situation
>>>>> =================
>>>>>
>>>>> Replication state is based on a pair-wise state reflecting source and
>>>>> target information (and filter functions etc). For the above scenario to
>>>>> be anywhere near plausible a couple things need to happen. First, we'll
>>>>> obviously need to transfer data from B -> C during replication so it
>>>>> has knowledge about A. This information will have to be complete enough
>>>>> to short circuit (or skip part of) a replication from A.
>>>>>
>>>>> The information that B sends to C will need to enable a replication from
>>>>> A to C to occur without error in any sort of pathological state of A
>>>>> irregardless of what state C thinks A is in. Changes in state may include
>>>>> A "forgetting" some edits and resetting to a point in time the state
>>>>> that C has (for instance, A crashed and was recovered to a previous
>>>>> point in time).
>>>>>
>>>>> C will also need to be able to uniquely identify A regardless of host or
>>>>> other transitory characteristics.
>>>>>
>>>>> An Old Proposition
>>>>> ==================
>>>>>
>>>>> There's been a proposal floated a few times for a few different reasons
>>>>> to give each database a UUID so that it is uniquely identifiable for
>>>>> various reasons (ETags come to mind). Such a UUID were it to exist would
>>>>> allow us to uniquely identify a database in the above scenario.
>>>>>
>>>>> The first issue with db UUID's that always pops up is that we have to
>>>>> address the case of what happens when someone copies a database (perhaps
>>>>> to short circuit an initial replication, or restoring a db when a
>>>>> machine fails) is that the UUID may no longer be globally unique.
>>>>>
>>>>> This would need to be fixed for transitive replication to have any
>>>>> chance of working. One solution that was mentioned was to have each
>>>>> CouchDB node remember all UUID's that it knows about and if a db is
>>>>> opened with an unknown UUID, that db gets a new UUID assigned.
>>>>>
>>>>> This could be accomplished efficiently by storing _local docs in the
>>>>> replicator database that reference known UUID/dbname pairs. Then we
>>>>> just lookup the UUID on db open and if it doesn't match the db name
>>>>> we reset it.
>>>>>
>>>>> For upgrade compatibility and the ability to change UUID's often we
>>>>> could just store the UUID in the db header (as opposed to the first
>>>>> sixteen bytes of the file).
>>>>>
>>>>> Information Propagation Requirements
>>>>> ====================================
>>>>>
>>>>> When replication occurs we need to inform the target database of a
>>>>> few pieces of information so that it knows about transitive replications
>>>>> that it contains. We also need to make sure that the target db doesn't
>>>>> learn about this information before it contains the entire replica set
>>>>> and it needs to be processed in such a way that it doesn't require
>>>>> complete replications.
>>>>>
>>>>> These requirements pretty much lead us to the fact that the replica
>>>>> state will need to be beamed across as the target receives information
>>>>> from the source update sequence. Ie, when we iterate the _changes feed
>>>>> we get extra info when we've arrived an update_seq that wholly contains
>>>>> some prior replication from an arbitrary node to the *source*.
>>>>>
>>>>> Information to Propagate
>>>>> ========================
>>>>>
>>>>> Now we need to consider what information needs to exist on a db in
>>>>> order to figure out if we *can* short circuit a replication as well as
>>>>> where we fast forward *to*.
>>>>>
>>>>> One obvious piece of information is the UUID of the database stream. A
>>>>> second piece would be the update_seq for that UUID. After some thought
>>>>> we also realize we need to store some more information to check if that
>>>>> UUID-update_seq pair is still valid when we go to fast-forward.
>>>>>
>>>>> The case that could invalidate a pair is if a database crashes and it
>>>>> needs to be restored. Consider if A replicates to B replicates to C. C
>>>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>>>>> thought experiment. Now at some point after C learns of A, A crashes and
>>>>> is restored from backup. Now A is at update_seq 5. Now we go on with
>>>>> our business and write 5 docs to A. But we also write 5 *different* docs
>>>>> than we wrote before the restore. This divergence in history would not
>>>>> be detectable without extra information.
>>>>>
>>>>> After much hand waving about rolling hashes, Adam decided to remember
>>>>> that we store a replication history between two db's. This can be
>>>>> represented as a _local doc id that includes information on the pair
>>>>> of db's as well as a random session id. If we include this data with
>>>>> the UUID-update_seq pair, when we check if a short circuit is possible
>>>>> we can check that this record still exists.
>>>>>
>>>>> In the case of the crash/restore even if we go and make the same edits
>>>>> and even have a similar replication history, the randomness to the
>>>>> session id will inform us that something has gone awry and we need to
>>>>> run a full replication to make sure we have all history.
>>>>>
>>>>>
>>>>> Information Required to Trigger Propagation
>>>>> ===========================================
>>>>>
>>>>> Along with the four pieces of information mentioned above, we also need
>>>>> to store what update_seq in the target database was the *result* of a
>>>>> replication. Ie, when we replicate A -> B, B needs to know the final
>>>>> update_seq of that replication transaction. This is so that when B
>>>>> replicates to C, it knows when to tell C about A. We can't do this at the
>>>>> very beginning because the replication might fail before all of the
>>>>> info from A is replicated. We also can't wait until the end because then
>>>>> C may never learn of A because of failure.
>>>>>
>>>>> This means that we need to know for a given update_seq if after it has
>>>>> been replicated, C can suddenly fast-forward a replication with someone
>>>>> other than B. To do this B will need to be able to stream its update
>>>>> sequence and efficiently check if that completes some replication record
>>>>> that C should know about.
>>>>>
>>>>> We might quickly assume that storing this in the existing update seq
>>>>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>>>>> 6 on B is the end of the replication A -> B. Now consider that B starts
>>>>> replicating to C while someone starts updating the doc for update_seq
>>>>> 6 on B. Its possible that a series of events could lead to C never
>>>>> learning of A because the update_seq for the doc id from 6 keeps jumping
>>>>> to the latest update_seq.
>>>>>
>>>>> The proper way to fix this would be to insert code that says "when an
>>>>> update_seq entry is updated, move its replication info to the next update
>>>>> seq" which sounds like it could get really quite wonky.
>>>>>
>>>>> So the solution would be to have some sort of indexed structure of
>>>>> replication records that can be scanned to know when to send out some
>>>>> replication finished....
>>>>>
>>>>> Ruh Roh
>>>>> =======
>>>>>
>>>>> I just realized something wonky with this whole plan. We *don't*
>>>>> necessarily know when a replication ends because of update sequences. For
>>>>> instance, if we replicate A -> B, and then edit a doc from A on B, and 
>>>>> then
>>>>> replicate B -> C, can we ever know when to short circuit a replication?
>>>>>
>>>>> This could be a huge gaping whole. Someone prove me wrong.
>>>>>
>>>>> Storing Replication State
>>>>> =========================
>>>>>
>>>>> With this new piece of information we'll also require some way to store
>>>>> replication state. This should hopefully be hand-wavy trivial by just
>>>>> storing replication records in _local docs very similarly to how they're
>>>>> currently stored.
>>>>>
>>>>> </brain_dump>
>>>>>
>>>>
>>>> The important point of my ruh roh to realize that I failed to
>>>> articulate, the reason that this is bad is that if when we edit the
>>>> doc on B before replication to C, C *can't* know what's on A until it
>>>> gets to the new version of the doc in B. This coupled with the fact
>>>> that we can edit anything on B, and that they all jump to the end
>>>> makes me think that we'd have to do some more extensive bookkeeping to
>>>> make sure that C doesn't know about B until after all of A's docs get
>>>> pushed.
>>>>
>>>> Blargghhh....
>>>>
>>>
>>> Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
>>>
>>
>> After sleeping on it, I think that this doesn't shoot the whole idea
>> out of the sky, but requires us to only send the info when a
>> replication manages to reach the end of the update_seq btree in a
>> single db snapshot. I'm not sure if that means that it'd be out of the
>> question for continuous replication or not.
>
> Hi Paul, thanks for this articulate writeup.  I think you're correct in this 
> last email, we can only send these extra bits of information about other 
> replications whenever we've reached the end of an MVCC snapshot from the 
> current source.  That shouldn't be a problem for continuous replication, 
> since under the hood it's implemented as a loop of "open / walk seq_tree / 
> wait for new updates" calls.  We can just send any new transitive checkpoints 
> that we encountered during the current walk just before going into the "wait 
> for new updates" step.
>
> Adam


heh, articulate.

Sounds good for continuous replication. I wasn't certain if it just
held on to an open changes feed or not, but if not then it sounds like
it'll all be clean enough.

Re: idea for transitive replication checkpoints

Reply via email to