Re: Proposal: Review DBs

Wout Mertens Mon, 27 Apr 2009 15:44:34 -0700

On Apr 27, 2009, at 5:20 AM, Chris Anderson wrote:

On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> wrote:
You'd still be left with the problem of generating unique docids for the documents in the Review DB, but I think that's a problem that needs to be solved. The restriction to only MR views with no duplicate keys across views seems too strong to me.
Well, since the Review DB is a local(*) hidden database that's handled a bit specially, I think the easiest is to assign _id a sequence number and create a default view that indexes the documents by doc.key (for updating the value for that key). There will never be contention and we're only interested in the key index.
We discussed this a little at CouchHack and I argued that the simplest solution is actually good for a few reasons.
The simple solution: provide a mechanism to copy the rows of a grouped reduce function to a new database.


Ok, the problems I see with that though are:
- How to assign _ids to the rows
- Separate design doc needed for each DB
  - Spreads application logic
  - All data not available from one parent URL a la CouchApps
- Namespace pollution, all these utility DBs

Or do you mean this as a single-shot data dump? Couldn't that get quite expensive, storage wise?

Good because it is most like Hadoop/Google style map reduce. In that paradigm, the output of a map/reduce job is not incremental, and it is persisted in a way that allows for multiple later reduce stages to be run on it. It's common in Hadoop to chain many m/r stages, and to try a few iterations of each stage while developing code.

Hmmm. If a hidden DB/view index is used, then the same function hashing techniques will work to decide which index to use for intermediate queries. I see no functional difference here.

I like this also because it provides the needed functionality without adding any new primitives to CouchDB.

But how would that mechanism be used if there's no new primitives? If CouchDB would allow an extra field "inputs" on the view definitions, that's it as far as user-visible changes go in the current thinking for review DBs.

The only downside of this approach is that it is not incremental. I'm not sure that incremental chainability has much promise, as the index management could be a pain, especially if you have branching chains.

Hmmm, I think that I showed that it needn't be. Any update to a view would trigger review index updates for all the views that have that view as input. Subsequent updates of those views then get propagated onwards in the same fashion. Nothing painful...

If you want the latest info, first update the input views and then the review view.

Another upside is that by reducing to a db, you give the user power to do things like use replication to merge multiple data sets before applying more views.

That's true... And I suppose it would be very useful in that case. I think there's room for both approaches perhaps?


Wout.

smime.p7s
Description: S/MIME cryptographic signature

Re: Proposal: Review DBs

Reply via email to