On Mon, Dec 12, 2011 at 10:26 PM, Paul Davis <[email protected]> wrote: >> * Merkle trees are great for two-way synchronization, but it's not >> immediately clear to me how you'd use them to bootstrap a single source -> >> target replication. I might just be missing a straightforward extension of >> the tech here. > This is the point that's important with checksums and so on. Merkle > trees are great when you want to mirror structured data but CouchDB > replication is a mirror operation. Think, N db's replicating to a > central DB. you have a mixture of things which breaks checksums (or at > least any obvious application I can think of given our internal > structures) Uhm. What are the things that break checksums? Right now revision IDs are _almost_ deterministic and it's not that hard to make them completely deterministic. And for replication purposes nothing else matters.
To be exact: the only entity used for ID generation is '[Deleted, OldStart, OldRev, Body, Atts2]' tuple and only 'Atts2' field can be non-deterministic. And that can be fixed (with other minor forward-looking features like explicit versioning). Then it's easy to devise a protocol to replicate based on hash trees. I'm thinking about this protocol: 1) The current state of replicated database is identified by a hash. Suppose that we have unidirectional replication A->B. Let's denote state of the initial database A as A1 and B's as B1. We store the ancestry as a list of hashes outside database (so it doesn't influence the hash of the database). 2) As the first step B sends its list of replication ancestry. It's actually not even required to send the whole hashes each time, just send the first 4 bytes of each hash. That way even 1 million records of replication history would take only 4Mb. The 'A' server then replies with its own set of hashes with the matching initial bytes. If there are none, then the client falls back to the usual replication. So at this step 'B' knows the most recent common ancestor and requests the changes that have happened since that point of time. Each changeset, naturally, has its own hash. 3) After these changes are applied and merged, B's state is the A1 state plus all the B's changes that might have happened ever since. Then B stores the hashes of the changesets that have been applied. That's it. Should work, as far as I see (it's 3am local time, so I might miss something). Overhead: 16 bytes for the hash information for each changeset.
