(I suppose this is a [DISCUSS] - but isn’t that really the default for a 
*mailing list*? :)

I’ve been having a few discussions with folks about how we might evolve to a 
more efficient replication protocol, and wanted to share some thoughts here. 
Apologies in advance, this one is kinda long.

As a reminder, the current replication protocol goes something like this:

1. Read the _changes feed with the style=all_docs option to get a list of all 
leaf revisions for each document
2. Ask the target DB which revisions are missing for using the _revs_diff 
endpoint (this is a batch operation across multiple docs)
3. If any document has missing revisions on the target, retrieve all of those 
revisions (and their full history) in a multipart request per doc
4. Save the missing revisions on the target. Documents with attachments are 
saved individually using the multipart API, otherwise _bulk_docs is used

I would posit that most replication jobs fall into one of two scenarios:

Scenario A: First-time replication to a new database where we need to transfer 
*everything* — all revisions, and full paths for each
Scenario B: Incremental replication; we likely only need to transfer the 
contents of the specific revision located at that seq in the feed, and a 
partial path is likely enough to merge the revision correctly into the target 
document.

Our current protocol doesn’t quite handle either of those scenarios optimally, 
partly because it also wants to be efficient for the case where two peers have 
some overlapping set of edits that they’ve acquired by independent means and 
they want to avoid unnecessarily transferring document bodies. In practice I 
don’t think that happens very often. Here are some ideas for improvements aimed 
at making those two scenarios above more efficient. The summarized list I have 
in mind includes three “safe” optimizations:

- Implement a shard-level _changes interface
- Identify the document revision(s) actually associated with each sequence in 
the feed, and include their bodies and revision histories
- Provide the output of _revs_diff in the response to a new_edits=false 
operation

and a somewhat more exotic attempt at saving on bandwidth for replication of 
frequently-edited documents:

- Include a configurable amount of revision history
- Have the target DB report whether it was able to extend existing revision 
paths when saving with new_edits=false

I’ll go into more detail on each of those below.

## Implement a shard-level _changes interface

This one is applicable to both scenarios. The source cluster spends a lot of 
CPU cycles merging the feeds from individual shards and (especially) computing 
a cluster-wide sequence. This sequence obscures the actual shard that 
contributed the row in question, it doesn’t sort very cleanly, uses a boatload 
of bandwidth, and cannot be easily decoded by consumers. In general, yuck.

Some applications will continue to benefit from having a single URL that 
provides the full database’s list of changes, but many consumers can deal quite 
well with a set of feeds to represent a database. Messaging systems like Kafka 
can be taught that a particular source is sharded, and each of the individual 
feeds has its own ordered sequence. Consuming feeds this way makes the dreaded 
“rewind” that can occur when a node fails a more obvious event, and an easier 
one for the external system to handle.

Over the years we’ve grown to be quite careful about keeping enough metadata to 
ensure that we can uniquely identify a particular commit in the database. The 
full list is

- shard range
- file creation UUID
- sequence number
- node that owned the file for this “epoch” (a range of sequence numbers)

The file creation UUID is important because a shard file might be destroyed on 
a node (hardware failure, operator error, etc.) and then rebuilt via internal 
replication, in which case the sequence order is not preserved. A slightly less 
robust alternative is the file creation timestamp embedded in the filename. The 
“epoch node” is important because operators have been known to copy database 
shard files around the cluster from the time to time, and it’s thus possible 
for two different copies of a shard to actually share the not-so-unique UUID 
for a range of their sequences.

I think that a shard-level _changes feed would do well to surface these bits of 
metadata directly, perhaps as properties in the JSON row. If we were concerned 
about bandwidth we could only emit those properties in a row when they changed 
compared to the previous sequence.

## Identify the document revision(s) actually associated with each sequence in 
the feed, and include their bodies and revision histories

I had to go spelunking in the codebase to check on the behavior here, and I was 
unpleasantly surprised to find that the _changes feed always includes the 
“winning” revision as the main revision in the row, *not* the revision(s) 
actually recorded into the database at this sequence. The leafs of the revision 
tree do record the sequence at which they were committed so we do have this 
information.

The reason we want to identify the revisions committed in this sequence is 
because those are the ones where we have the highest probability of needing to 
replicate the document body. The probability is so high that I think we should 
be proactively including them in the _changes feed. We can save quite a few 
lookups this way. 

I don’t want to be an alarmist here — 99% of the time the revision committed at 
a particular sequence *is* the winning revision. But it’s not required to be 
the case, and for replication purposes the winner is not the best version to 
include.

Of course, there are other consumers of the _changes feed besides replication 
who may not be at all interested in edits that are not on the winning branch, 
so we likely need a flag that controls this behavior. An analytics system, for 
example, could use the high-performance shard-level feeds to grab a full 
snapshot of the database, and would want to receive just the winning revision 
of each document.

Finally, if we know we’re in Scenario A we would want to receive the bodies of 
*all* leaf revisions ASAP, so that’s yet a third option.

## Provide the output of _revs_diff in the response to a new_edits=false 
operation

I think this one is relatively simple. We’re confident that the update(s) which 
just landed in our _changes feed need to be replicated, so we eagerly include 
the body and history in a new_edits=false request. We also include the other 
leaf revisions for the document when we do this, and the server checks to see 
if any of those revisions are missing as well, and includes that information in 
the response. If there are additional revisions required, we have to go back 
and fetch them.

Now, onto the more speculative stuff ...

## Include a configurable amount of revision history

If we’re in Scenario A, we want the whole revision history. In Scenario B, it’s 
often quite wasteful to include the full path (which can be > 16KB for a 
frequently-edited document using the default _revs_limit) when it’s likely that 
just the last few entries will suffice to merge the edit into the target. A 
smart replicator could usually guess which scenario it’s most likely to find 
itself in using some basic heuristics, but I’m curious if others have creative 
ideas of how to handle this.

## Have the target DB report whether it was able to extend existing revision 
paths when saving with new_edits=false

If we request just a recent portion of the revision history, there will be 
times when we don’t have enough history and we accidentally create a spurious 
edit conflict on the target. Of course, this is exactly what happens if 
_revs_limit is set too low on the source, except in this circumstance we have a 
chance to correct things because we still have additional revision metadata 
sitting on the source that we omitted to conserve bandwidth. If the target 
server can let us know that a document revision could not be grafted onto an 
existing edit branch (and it started with a rev position > 1), then the 
replicator might try again, this time requesting the full revision history from 
the source and eliminating the spurious conflict.

… So that’s what I’ve got for now. Hopefully it makes sense. I haven’t tried to 
write down exact APIs yet but would save that for the RFC(s) that might come 
out of the discussion. Cheers,

Adam

Reply via email to