@jchris et al, if you had any pointer, on how to implement this, i have a strong motivation to try my hand at it.
i have a janky ruby script running as an update notifier that looks for certain criteria, idiomatic to my data, that puts docs into a derived database. but i'm not terribly happy with my current implementation... is there a general-purpose algorithm for dealing with updates? cheers, zach On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]> wrote: > > > Sent from my iPhone > > On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> wrote: > >> Hi Adam, >> >> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote: >> >>> Hi Wout, thanks for writing this up. >>> >>> One comment about the map-only views: I think you'll find that Couch has >>> already done a good bit of the work needed to support them, too. Couch >>> maintains a btree for each design doc keyed on docid that stores all the >>> view keys emitted by the maps over each document. When a document is >>> updated and then analyzed, Couch has to consult that btree, purge all the >>> KVs associated with the old version of the doc from each view, and then >>> insert the new KVs. So the tracking information correlating docids and view >>> keys is already available. >> >> See I did not know that :-) Although I should have guessed. >> >> However, in the mail before this one I argued that it doesn't make sense >> to combine or chain map-only views since you can always write a map function >> that does it in one step. Do you agree? >> >> You might also know the answer to this: is it possible to make the Review >> DB be a sort of view index on the current database? All it needs are JSON >> keys and values, no other fields. >> >>> You'd still be left with the problem of generating unique docids for the >>> documents in the Review DB, but I think that's a problem that needs to be >>> solved. The restriction to only MR views with no duplicate keys across >>> views seems too strong to me. >> >> Well, since the Review DB is a local(*) hidden database that's handled a >> bit specially, I think the easiest is to assign _id a sequence number and >> create a default view that indexes the documents by doc.key (for updating >> the value for that key). There will never be contention and we're only >> interested in the key index. > > We discussed this a little at CouchHack and I argued that the simplest > solution is actually good for a few reasons. > > The simple solution: provide a mechanism to copy the rows of a grouped > reduce function to a new database. > > Good because it is most like Hadoop/Google style map reduce. In that > paradigm, the output of a map/reduce job is not incremental, and it is > persisted in a way that allows for multiple later reduce stages to be run on > it. It's common in Hadoop to chain many m/r stages, and to try a few > iterations of each stage while developing code. > > I like this also because it provides the needed functionality without adding > any new primitives to CouchDB. > > The only downside of this approach is that it is not incremental. I'm not > sure that incremental chainability has much promise, as the index management > could be a pain, especially if you have branching chains. > > Another upside is that by reducing to a db, you give the user power to do > things like use replication to merge multiple data sets before applying more > views. > > I don't want to discourage anyone from experimenting with code, just want to > point out this simple solution which would be Very Easy to implement. > >> >> >> (*)local: I'm assuming that views are not replicated and need to be >> recalculated for each CouchDB node. If they are replicated somehow, I think >> it would still work but we'd have to look at it a little more. >> >>> With that said, I'd prefer to spend my time extending the view engine to >>> handle chainable MR workflows in a single shot. Especially in the simple >>> sort_by_value case it just seems like a cleaner way to go about things. >> >> Yes, that seems to be the gist of all repliers and I agree :-) >> >> In a nutshell, I'm hoping that: >> * A review is a new sort of view that has an "inputs" array in its >> definition. >> * Only MR views are allowed as inputs, no KV duplication allowed. >> * It builds a persistent index of the incoming views when those get >> updated. >> * That index is then used to build the view index for the review when the >> review gets updated. >> * I think I covered the most important algorithms needed to implement this >> in my original proposal. >> >> Does this sound feasible? If so I'll update my proposal accordingly. >> >> Wout. >
