Re: Proposal: Review DBs

Chris Anderson Sun, 26 Apr 2009 21:20:42 -0700


Sent from my iPhone

On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]>wrote:

Hi Adam,

On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
Hi Wout, thanks for writing this up.
One comment about the map-only views: I think you'll find thatCouch has already done a good bit of the work needed to supportthem, too. Couch maintains a btree for each design doc keyed ondocid that stores all the view keys emitted by the maps over eachdocument. When a document is updated and then analyzed, Couch hasto consult that btree, purge all the KVs associated with the oldversion of the doc from each view, and then insert the new KVs. Sothe tracking information correlating docids and view keys isalready available.
See I did not know that :-) Although I should have guessed.
However, in the mail before this one I argued that it doesn't makesense to combine or chain map-only views since you can always writea map function that does it in one step. Do you agree?
You might also know the answer to this: is it possible to make theReview DB be a sort of view index on the current database? All itneeds are JSON keys and values, no other fields.
You'd still be left with the problem of generating unique docidsfor the documents in the Review DB, but I think that's a problemthat needs to be solved. The restriction to only MR views with noduplicate keys across views seems too strong to me.
Well, since the Review DB is a local(*) hidden database that'shandled a bit specially, I think the easiest is to assign _id asequence number and create a default view that indexes the documentsby doc.key (for updating the value for that key). There will neverbe contention and we're only interested in the key index.

We discussed this a little at CouchHack and I argued that the simplestsolution is actually good for a few reasons.

The simple solution: provide a mechanism to copy the rows of a groupedreduce function to a new database.

Good because it is most like Hadoop/Google style map reduce. In thatparadigm, the output of a map/reduce job is not incremental, and it ispersisted in a way that allows for multiple later reduce stages to berun on it. It's common in Hadoop to chain many m/r stages, and to trya few iterations of each stage while developing code.

I like this also because it provides the needed functionality withoutadding any new primitives to CouchDB.

The only downside of this approach is that it is not incremental. I'mnot sure that incremental chainability has much promise, as the indexmanagement could be a pain, especially if you have branching chains.

Another upside is that by reducing to a db, you give the user power todo things like use replication to merge multiple data sets beforeapplying more views.

I don't want to discourage anyone from experimenting with code, justwant to point out this simple solution which would be Very Easy toimplement.

(*)local: I'm assuming that views are not replicated and need to berecalculated for each CouchDB node. If they are replicated somehow,I think it would still work but we'd have to look at it a little more.
With that said, I'd prefer to spend my time extending the viewengine to handle chainable MR workflows in a single shot.Especially in the simple sort_by_value case it just seems like acleaner way to go about things.
Yes, that seems to be the gist of all repliers and I agree :-)

In a nutshell, I'm hoping that:
* A review is a new sort of view that has an "inputs" array in itsdefinition.
* Only MR views are allowed as inputs, no KV duplication allowed.
* It builds a persistent index of the incoming views when those getupdated.* That index is then used to build the view index for the reviewwhen the review gets updated.* I think I covered the most important algorithms needed toimplement this in my original proposal.
Does this sound feasible? If so I'll update my proposal accordingly.

Wout.

Re: Proposal: Review DBs

Reply via email to