(I suppose this is a [DISCUSS] - but isn’t that really the default for a *mailing list*? :)
I’ve been having a few discussions with folks about how we might evolve to a more efficient replication protocol, and wanted to share some thoughts here. Apologies in advance, this one is kinda long. As a reminder, the current replication protocol goes something like this: 1. Read the _changes feed with the style=all_docs option to get a list of all leaf revisions for each document 2. Ask the target DB which revisions are missing for using the _revs_diff endpoint (this is a batch operation across multiple docs) 3. If any document has missing revisions on the target, retrieve all of those revisions (and their full history) in a multipart request per doc 4. Save the missing revisions on the target. Documents with attachments are saved individually using the multipart API, otherwise _bulk_docs is used I would posit that most replication jobs fall into one of two scenarios: Scenario A: First-time replication to a new database where we need to transfer *everything* — all revisions, and full paths for each Scenario B: Incremental replication; we likely only need to transfer the contents of the specific revision located at that seq in the feed, and a partial path is likely enough to merge the revision correctly into the target document. Our current protocol doesn’t quite handle either of those scenarios optimally, partly because it also wants to be efficient for the case where two peers have some overlapping set of edits that they’ve acquired by independent means and they want to avoid unnecessarily transferring document bodies. In practice I don’t think that happens very often. Here are some ideas for improvements aimed at making those two scenarios above more efficient. The summarized list I have in mind includes three “safe” optimizations: - Implement a shard-level _changes interface - Identify the document revision(s) actually associated with each sequence in the feed, and include their bodies and revision histories - Provide the output of _revs_diff in the response to a new_edits=false operation and a somewhat more exotic attempt at saving on bandwidth for replication of frequently-edited documents: - Include a configurable amount of revision history - Have the target DB report whether it was able to extend existing revision paths when saving with new_edits=false I’ll go into more detail on each of those below. ## Implement a shard-level _changes interface This one is applicable to both scenarios. The source cluster spends a lot of CPU cycles merging the feeds from individual shards and (especially) computing a cluster-wide sequence. This sequence obscures the actual shard that contributed the row in question, it doesn’t sort very cleanly, uses a boatload of bandwidth, and cannot be easily decoded by consumers. In general, yuck. Some applications will continue to benefit from having a single URL that provides the full database’s list of changes, but many consumers can deal quite well with a set of feeds to represent a database. Messaging systems like Kafka can be taught that a particular source is sharded, and each of the individual feeds has its own ordered sequence. Consuming feeds this way makes the dreaded “rewind” that can occur when a node fails a more obvious event, and an easier one for the external system to handle. Over the years we’ve grown to be quite careful about keeping enough metadata to ensure that we can uniquely identify a particular commit in the database. The full list is - shard range - file creation UUID - sequence number - node that owned the file for this “epoch” (a range of sequence numbers) The file creation UUID is important because a shard file might be destroyed on a node (hardware failure, operator error, etc.) and then rebuilt via internal replication, in which case the sequence order is not preserved. A slightly less robust alternative is the file creation timestamp embedded in the filename. The “epoch node” is important because operators have been known to copy database shard files around the cluster from the time to time, and it’s thus possible for two different copies of a shard to actually share the not-so-unique UUID for a range of their sequences. I think that a shard-level _changes feed would do well to surface these bits of metadata directly, perhaps as properties in the JSON row. If we were concerned about bandwidth we could only emit those properties in a row when they changed compared to the previous sequence. ## Identify the document revision(s) actually associated with each sequence in the feed, and include their bodies and revision histories I had to go spelunking in the codebase to check on the behavior here, and I was unpleasantly surprised to find that the _changes feed always includes the “winning” revision as the main revision in the row, *not* the revision(s) actually recorded into the database at this sequence. The leafs of the revision tree do record the sequence at which they were committed so we do have this information. The reason we want to identify the revisions committed in this sequence is because those are the ones where we have the highest probability of needing to replicate the document body. The probability is so high that I think we should be proactively including them in the _changes feed. We can save quite a few lookups this way. I don’t want to be an alarmist here — 99% of the time the revision committed at a particular sequence *is* the winning revision. But it’s not required to be the case, and for replication purposes the winner is not the best version to include. Of course, there are other consumers of the _changes feed besides replication who may not be at all interested in edits that are not on the winning branch, so we likely need a flag that controls this behavior. An analytics system, for example, could use the high-performance shard-level feeds to grab a full snapshot of the database, and would want to receive just the winning revision of each document. Finally, if we know we’re in Scenario A we would want to receive the bodies of *all* leaf revisions ASAP, so that’s yet a third option. ## Provide the output of _revs_diff in the response to a new_edits=false operation I think this one is relatively simple. We’re confident that the update(s) which just landed in our _changes feed need to be replicated, so we eagerly include the body and history in a new_edits=false request. We also include the other leaf revisions for the document when we do this, and the server checks to see if any of those revisions are missing as well, and includes that information in the response. If there are additional revisions required, we have to go back and fetch them. Now, onto the more speculative stuff ... ## Include a configurable amount of revision history If we’re in Scenario A, we want the whole revision history. In Scenario B, it’s often quite wasteful to include the full path (which can be > 16KB for a frequently-edited document using the default _revs_limit) when it’s likely that just the last few entries will suffice to merge the edit into the target. A smart replicator could usually guess which scenario it’s most likely to find itself in using some basic heuristics, but I’m curious if others have creative ideas of how to handle this. ## Have the target DB report whether it was able to extend existing revision paths when saving with new_edits=false If we request just a recent portion of the revision history, there will be times when we don’t have enough history and we accidentally create a spurious edit conflict on the target. Of course, this is exactly what happens if _revs_limit is set too low on the source, except in this circumstance we have a chance to correct things because we still have additional revision metadata sitting on the source that we omitted to conserve bandwidth. If the target server can let us know that a document revision could not be grafted onto an existing edit branch (and it started with a rev position > 1), then the replicator might try again, this time requesting the full revision history from the source and eliminating the spurious conflict. … So that’s what I’ve got for now. Hopefully it makes sense. I haven’t tried to write down exact APIs yet but would save that for the RFC(s) that might come out of the discussion. Cheers, Adam