Re: couch sequence numbers and _all_docs_by_seq

eric casteleijn Wed, 06 May 2009 01:23:39 -0700

If I remember correctly your problem, you want to process documents
when they first cross the transom into your cloud. If they are then
replicated out and about and come back again later, you don't need to
reprocess them. Of course anything involving client-changeable data is
not a 100% guarantee, but if you can live with occasionally
reprocessing documents, then you might try something like:


A view of all docs which do not have the cloud_processed property. And
then a process which is always trying to keep that view empty by
processing the docs it lists in whatever manner you need.

This is exactly what we do now: there is an update trigger, that queriesfor unprocessed documents, and bulk updates them to have a 'first_seen'field that holds the sequence number that the database was at when thetrigger was fired.

The reason I would like to have access to the sequence number ofdocuments in my views is similar but different: It would allow me towrite a view that gets all the documents of a particular type that werelast updated between two sequence numbers, without relying on an idprefix, which feels awkward, and is problematic for us, since we haveURIs for document types, which obviously won't work as part of the id,so we'd have to keep a mapping from type to prefix, and that is anotherstep away from simplicity.

Of course you'll be relying on clients to trigger "download" (from the
cloud to their local) replication about as often as they trigger
"upload" replication, otherwise your process will start to stack up
docs in a conflict state.

The other solution I think we talked about was maintaining an
independent database in the cloud, which just tracks which
document-ids have been processed. This avoids the conflicts scenario,
and when you think about what it means to the disks, it's about the
same cost as maintaining that view. However, you end up querying it
over and over again for each document you see, instead of just seeing
the relevant docs.

That is still a solution we might have to choose, but even if it's not aperformance problem, it increases code complexity.

I'd do whatever possible to avoid recording update_seq at the
application level, as CouchDB is not designed to make multi-node
guarantees about that property.

Yes, that way madness lies, and I'm not suggesting that. All I'd like isfor views I create myself to be able to use _seq as a key (or possiblyvalue) like _all_docs_by_seq does, to have more efficient ways ofquerying for data that changed either within the node or throughreplication, i.e. with a single view, rather than through calling_all_docs_by_seq and filtering in application code.


--
- eric casteleijn
http://www.canonical.com

Re: couch sequence numbers and _all_docs_by_seq

Reply via email to