On Mon, May 4, 2009 at 6:54 AM, eric casteleijn <[email protected]> wrote: >>> Ok, but just so we're on the same page, I think Eric was talking about >>> using the update_seq for that document in the view (i.e., the key for that >>> document in _all_docs_by_seq), not the server's latest update_seq. I don't >>> think that usage violates any requirements of the map function. > > Yes that is correct. > >>> Adam >>> >> I'd have to think harder on it. I'm pretty sure that you'd end up with the >> same documents and different view output depending on the node you were >> using. > > I think so too: At least if that would be the case for the _all_docs_by_seq, > and I'm pretty sure it is. That is not a problem for my use case, where we > have only one particular node that will be running the views for which I;d > want to use the _seq. > >> You could almost fix it by transmitting 'dead' update_seq's but that's a >> long road for the immediate question. > > If I understand you correctly, that's partly what I'm doing now as a > solution to a different problem, but it's really the opposite of elegant: To > have a handle on when a document was first seen on the 'node of interest' we > have an update trigger write a sequence number (which is approximate to the > one the db was at on the when the document was first seen there) into a > 'first_seen' field. Doing the same for a 'last_modified' field would be > worse, or even impossible, since writing to that field would trigger the > update trigger again. >
If I remember correctly your problem, you want to process documents when they first cross the transom into your cloud. If they are then replicated out and about and come back again later, you don't need to reprocess them. Of course anything involving client-changeable data is not a 100% guarantee, but if you can live with occasionally reprocessing documents, then you might try something like: A view of all docs which do not have the cloud_processed property. And then a process which is always trying to keep that view empty by processing the docs it lists in whatever manner you need. Of course you'll be relying on clients to trigger "download" (from the cloud to their local) replication about as often as they trigger "upload" replication, otherwise your process will start to stack up docs in a conflict state. The other solution I think we talked about was maintaining an independent database in the cloud, which just tracks which document-ids have been processed. This avoids the conflicts scenario, and when you think about what it means to the disks, it's about the same cost as maintaining that view. However, you end up querying it over and over again for each document you see, instead of just seeing the relevant docs. I'd do whatever possible to avoid recording update_seq at the application level, as CouchDB is not designed to make multi-node guarantees about that property. Chris -- Chris Anderson http://jchrisa.net http://couch.io
