If I remember correctly your problem, you want to process documents
when they first cross the transom into your cloud. If they are then
replicated out and about and come back again later, you don't need to
reprocess them. Of course anything involving client-changeable data is
not a 100% guarantee, but if you can live with occasionally
reprocessing documents, then you might try something like:
A view of all docs which do not have the cloud_processed property. And
then a process which is always trying to keep that view empty by
processing the docs it lists in whatever manner you need.
This is exactly what we do now: there is an update trigger, that queries
for unprocessed documents, and bulk updates them to have a 'first_seen'
field that holds the sequence number that the database was at when the
trigger was fired.
The reason I would like to have access to the sequence number of
documents in my views is similar but different: It would allow me to
write a view that gets all the documents of a particular type that were
last updated between two sequence numbers, without relying on an id
prefix, which feels awkward, and is problematic for us, since we have
URIs for document types, which obviously won't work as part of the id,
so we'd have to keep a mapping from type to prefix, and that is another
step away from simplicity.
Of course you'll be relying on clients to trigger "download" (from the
cloud to their local) replication about as often as they trigger
"upload" replication, otherwise your process will start to stack up
docs in a conflict state.
The other solution I think we talked about was maintaining an
independent database in the cloud, which just tracks which
document-ids have been processed. This avoids the conflicts scenario,
and when you think about what it means to the disks, it's about the
same cost as maintaining that view. However, you end up querying it
over and over again for each document you see, instead of just seeing
the relevant docs.
That is still a solution we might have to choose, but even if it's not a
performance problem, it increases code complexity.
I'd do whatever possible to avoid recording update_seq at the
application level, as CouchDB is not designed to make multi-node
guarantees about that property.
Yes, that way madness lies, and I'm not suggesting that. All I'd like is
for views I create myself to be able to use _seq as a key (or possibly
value) like _all_docs_by_seq does, to have more efficient ways of
querying for data that changed either within the node or through
replication, i.e. with a single view, rather than through calling
_all_docs_by_seq and filtering in application code.
--
- eric casteleijn
http://www.canonical.com