On Sun, Dec 11, 2011 at 7:19 PM, Randall Leeds <[email protected]> wrote: > On Sun, Dec 11, 2011 at 04:00, Alex Besogonov <[email protected]> > wrote: >> I wonder, why there are no unique instance IDs in CouchDB? I'm >> thinking about 'the central server replicates 2000000 documents to a >> million of clients' scenario. >> >> Right now it's not possible to make replication on the 'big central >> server' side to be stateless, because the other side tries to write >> replication document which is later used to establish common ancestry. >> Server can ignore/discard it, but then during the next replication >> client would just have to replicate all the changes again. Of course, >> the results would be consistent in any case but quite a lot of >> additional traffic might be required. >> >> It should be simple to assign each instance a unique ID (computed >> using UUID and the set of applied replication filters) and use it to >> establish common replication history. It can even be compatible with >> the way the current replication system works and basically the only >> visible change should be the addition of UUID to database info. >> >> Or am I missing something? > > I proposed UUIDs for databases a long, long time ago and it's come up > a few times since. If the UUID is database-level, then storing it with > the database is dangerous -- copying a database file would result in > two CouchDB's hosting "the same" (but really different) databases. If > the UUID is host-level, then this reduces to a re-invention of DNS. In > other words, all DBs should already be uniquely identified by their > URLs. > > Regarding your second paragraph, replicating couches _could_ try to > establish common ancestry only by examining a local checkpoint of > replication, but the couch replicator looks for the log on both > couches to ensure that the database hasn't been deleted+recreated nor > has it crashed before certain replicated changes hit disk, as a double > check that the sequence numbers have the expected shared meaning. > > It seems like maybe you're wondering about whether couch could > generate snapshot ids that are more meaningful than the sequence > number. For a single pair of couches the host-db-seq combo is enough > information to replicate effectively. When there's more hosts involved > we can talk about more powerful checkpoint ids that would be shareable > or resolvable to find common ancestry between more than two > replicating hosts to speed up those scenarios. My intuition always > says that this leads to hash trees, but I haven't thought about it > deeply enough to fully conceive of what this accomplishes or how it > would work. > > -R
I did have a shimmering of an idea for this awhile back. Basically we do both host and db uuid's and the information we use to identifiy replications is a hash of the concatenation. That way we can copy db's around and not muck with things as well as error out a bit. Though this still has a bit of an issue if we copy the host uuid around as well. Though we migth be able to look for a mac address or something and then fail to boot if the check fails (with an optional override if someone changes a nic).
