Re: Unique instance IDs?

Adam Kocoloski Sun, 11 Dec 2011 19:32:21 -0800

On Dec 11, 2011, at 10:21 PM, Jason Smith wrote:

> On Mon, Dec 12, 2011 at 9:52 AM, Paul Davis <[email protected]> 
> wrote:
>> On Sun, Dec 11, 2011 at 7:19 PM, Randall Leeds <[email protected]> 
>> wrote:
>>> On Sun, Dec 11, 2011 at 04:00, Alex Besogonov <[email protected]> 
>>> wrote:
>>>> I wonder, why there are no unique instance IDs in CouchDB? I'm
>>>> thinking about 'the central server replicates 2000000 documents to a
>>>> million of clients' scenario.
>>>> 
>>>> Right now it's not possible to make replication on the 'big central
>>>> server' side to be stateless, because the other side tries to write
>>>> replication document which is later used to establish common ancestry.
>>>> Server can ignore/discard it, but then during the next replication
>>>> client would just have to replicate all the changes again. Of course,
>>>> the results would be consistent in any case but quite a lot of
>>>> additional traffic might be required.
>>>> 
>>>> It should be simple to assign each instance a unique ID (computed
>>>> using UUID and the set of applied replication filters) and use it to
>>>> establish common replication history. It can even be compatible with
>>>> the way the current replication system works and basically the only
>>>> visible change should be the addition of UUID to database info.
>>>> 
>>>> Or am I missing something?
>>> 
>>> I proposed UUIDs for databases a long, long time ago and it's come up
>>> a few times since. If the UUID is database-level, then storing it with
>>> the database is dangerous -- copying a database file would result in
>>> two CouchDB's hosting "the same" (but really different) databases. If
>>> the UUID is host-level, then this reduces to a re-invention of DNS. In
>>> other words, all DBs should already be uniquely identified by their
>>> URLs.
>>> 
>>> Regarding your second paragraph, replicating couches _could_ try to
>>> establish common ancestry only by examining a local checkpoint of
>>> replication, but the couch replicator looks for the log on both
>>> couches to ensure that the database hasn't been deleted+recreated nor
>>> has it crashed before certain replicated changes hit disk, as a double
>>> check that the sequence numbers have the expected shared meaning.
>>> 
>>> It seems like maybe you're wondering about whether couch could
>>> generate snapshot ids that are more meaningful than the sequence
>>> number. For a single pair of couches the host-db-seq combo is enough
>>> information to replicate effectively. When there's more hosts involved
>>> we can talk about more powerful checkpoint ids that would be shareable
>>> or resolvable to find common ancestry between more than two
>>> replicating hosts to speed up those scenarios. My intuition always
>>> says that this leads to hash trees, but I haven't thought about it
>>> deeply enough to fully conceive of what this accomplishes or how it
>>> would work.
>>> 
>>> -R
>> 
>> I did have a shimmering of an idea for this awhile back. Basically we
>> do both host and db uuid's and the information we use to identifiy
>> replications is a hash of the concatenation.
>> 
>> That way we can copy db's around and not muck with things as well as
>> error out a bit. Though this still has a bit of an issue if we copy
>> the host uuid around as well. Though we migth be able to look for a
>> mac address or something and then fail to boot if the check fails
>> (with an optional override if someone changes a nic).
> 
> A couch URL is its unique identifier. A database URL is its unique
> identifier. This sounds like a too-clever-by-half optimization. IMHO.
> 
> -- 
> Iris Couch


Clever or not, transitive replication checkpoints would be a pretty significant 
optimization.  I don't think it has to end up in the land of hash trees, though 
I'll grant that those structures are a fine tool to quickly identify 
discrepancies between any two databases (and they map nicely to couch_btree to 
boot).

Adam

Re: Unique instance IDs?

Reply via email to