On Dec 12, 2011, at 10:10 PM, Jason Smith wrote: > On Tue, Dec 13, 2011 at 8:40 AM, Paul Davis <[email protected]> > wrote: >>> If there were a hypothetical single query which let the receiver >>> assess its exact relationship to an arbitrary sender's data, I don't >>> think "starts over" would sound as awful. >>> >> >> I agree whole heartedly. And the easiest way I see to making that >> happen is to decouple the host and db identities in such a way that >> this is a reality. Its possible there's something elegant we could >> pull from things like merkle trees. I've spent time considering it and >> haven't thought of anything but I'd be tickled pink if there were a >> reasonable solution there. > > Yeah. That is why I keep thinking of a checksum that works well with > incremental map/reduce. I always recall that CRC32 is a commutative, > associative checksum algorithm. It could hypothetically give you a > checksum of the entire tree, and all subtrees down to the leaves, as a > Couch reduce function. So the idea is to reduce the by_seq index. You > get checksums of the database or subsets free or cheap. > > At this point I am out of my expertise though so I defer. > > -- > Iris Couch
Yep, that's a Merkle tree, and brings us back to where this thread sat 24 hours ago. Couple of points: * You want to stuff the checksums in the id_tree, not the seq_tree. If you use the seq_tree you'll never be able to apply updates that get the checksums aligned. * Merkle trees are great for two-way synchronization, but it's not immediately clear to me how you'd use them to bootstrap a single source -> target replication. I might just be missing a straightforward extension of the tech here. Adam
