Re: couch sequence numbers and _all_docs_by_seq

Chris Anderson Mon, 04 May 2009 08:54:40 -0700

On Mon, May 4, 2009 at 6:54 AM, eric casteleijn
<[email protected]> wrote:
>>> Ok, but just so we're on the same page, I think Eric was talking about
>>> using the update_seq for that document in the view (i.e., the key for that
>>> document in _all_docs_by_seq), not the server's latest update_seq. I don't
>>> think that usage violates any requirements of the map function.
>
> Yes that is correct.
>
>>> Adam
>>>
>> I'd have to think harder on it. I'm pretty sure that you'd end up with the
>> same documents and different view output depending on the node you were
>> using.
>
> I think so too: At least if that would be the case for the _all_docs_by_seq,
> and I'm pretty sure it is. That is not a problem for my use case, where we
> have only one particular node that will be running the views for which I;d
> want to use the _seq.
>
>> You could almost fix it by transmitting 'dead' update_seq's but that's a
>> long road for the immediate question.
>
> If I understand you correctly, that's partly what I'm doing now as a
> solution to a different problem, but it's really the opposite of elegant: To
> have a handle on when a document was first seen on the 'node of interest' we
> have an update trigger write a sequence number (which is approximate to the
> one the db was at on the when the document was first seen there) into a
> 'first_seen' field. Doing the same for a 'last_modified' field would be
> worse, or even impossible, since writing to that field would trigger the
> update trigger again.
>


If I remember correctly your problem, you want to process documents
when they first cross the transom into your cloud. If they are then
replicated out and about and come back again later, you don't need to
reprocess them. Of course anything involving client-changeable data is
not a 100% guarantee, but if you can live with occasionally
reprocessing documents, then you might try something like:

A view of all docs which do not have the cloud_processed property. And
then a process which is always trying to keep that view empty by
processing the docs it lists in whatever manner you need.

Of course you'll be relying on clients to trigger "download" (from the
cloud to their local) replication about as often as they trigger
"upload" replication, otherwise your process will start to stack up
docs in a conflict state.

The other solution I think we talked about was maintaining an
independent database in the cloud, which just tracks which
document-ids have been processed. This avoids the conflicts scenario,
and when you think about what it means to the disks, it's about the
same cost as maintaining that view. However, you end up querying it
over and over again for each document you see, instead of just seeing
the relevant docs.

I'd do whatever possible to avoid recording update_seq at the
application level, as CouchDB is not designed to make multi-node
guarantees about that property.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: couch sequence numbers and _all_docs_by_seq

Reply via email to