Zachary,

No worries, the rough outline I'd do here is something like:

1. Figure out some member structure in the _design document that will represent your data flow. For them moment I would do something extremely simple as in:

Assume:
db_name = "base_db"

{
   "_id":  " _design/base",
   "views": {
       "stage-1": {
           "map": "function(doc) ...",
           "reduce": "function(keys, values, rereduce) ..."
       }
   },
   "review": [
{"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce) ..."}, {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce) ..."}
   ]
}

So the review member becomes the stages in your data flow. I'm avoiding any forking or merging in this example in honor of the "make it work, make it not suck" development flow.

Now the basic algorithm would be something like:

For each array element in the "review" member, create a db something like:

base_db-stage-1 with a design document that contains a view with the first element of the "reviews" member.
base_db-stage-2 with the second member and so on.

Then your script can check the view status in each database either with a cron (or an update_notifier) to do so, you can just:

HEAD /base_db/_design/base/_view/stage-1

And then check the returned ETag. For the moment this is exactly equivalent to checking the database's update_seq because of how the etag is calculated, but in the future when we track the last update_seq for each view change this will be a free upgrade. Plus there's a bit more logical-ness to checking "view state" instead of "db state".

When the etag's don't match, you can just drop the next db in the flow, create it, and then copy the view output. The drop/create just makes the algorithm easily implementable for now. In the future there can be some extra logic to only change the new view as far as it requires by iterating over the two views and doing a merge sortish type of thing. I think... Sounds like there should be a way at least.

Once that works we can look at bolting on different fancy things like having forking map/reduce mechanisms and my current pet idea of adding in the merge stuff that has been talked about.

This is actually starting to sound like a fun little project....

HTH,
Paul Davis

Zachary Zolton wrote:
paul

alright... you've gotta give me the remedial explanation of what you
meant here! (sorry, i'm still noob-ish)

so, are you saying that i shouldn't even check for individual doc
updates, but instead just recreate the entire database? that sounds
like a job for cron, more so than the update notifier, right?

i'd put up my current ruby script, but it deals update notifications
in a way that's very specific to my data —probably very naïvely, to
boot!

zach

On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
<[email protected]> wrote:
Zachary,

Awesome. The thing with non-incremental updates is that the basic algorithm
would be to just look for updates to the view and on update, delete the
review DB, create a new one, and then dump the new data into it. I wouldn't
try too hard for the optimizing updates at this point in time.

Getting a ruby script out to show the basics should probably be the first
step. Beyond that we'll have to take it a step at a time.

HTH,
Paul Davis


Zachary Zolton wrote:
@jchris et al,

if you had any pointer, on how to implement this, i have a strong
motivation to try my hand at it.

i have a janky ruby script running as an update notifier that looks
for certain criteria, idiomatic to my data, that puts docs into a
derived database. but i'm not terribly happy with my current
implementation...

is there a general-purpose algorithm for dealing with updates?


cheers,

zach


On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]> wrote:

Sent from my iPhone

On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> wrote:


Hi Adam,

On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:


Hi Wout, thanks for writing this up.

One comment about the map-only views:  I think you'll find that Couch
has
already done a good bit of the work needed to support them, too.  Couch
maintains a btree for each design doc keyed on docid that stores all
the
view keys emitted by the maps over each document.  When a document is
updated and then analyzed, Couch has to consult that btree, purge all
the
KVs associated with the old version of the doc from each view, and then
insert the new KVs.  So the tracking information correlating docids and
view
keys is already available.

See I did not know that :-) Although I should have guessed.

However, in the mail before this one I argued that it doesn't make sense
to combine or chain map-only views since you can always write a map
function
that does it in one step. Do you agree?

You might also know the answer to this: is it possible to make the
Review
DB be a sort of view index on the current database? All it needs are
JSON
keys and values, no other fields.


You'd still be left with the problem of generating unique docids for
the
documents in the Review DB, but I think that's a problem that needs to
be
solved.  The restriction to only MR views with no duplicate keys across
views seems too strong to me.

Well, since the Review DB is a local(*) hidden database that's handled a
bit specially, I think the easiest is to assign _id a sequence number
and
create a default view that indexes the documents by doc.key (for
updating
the value for that key). There will never be contention and we're only
interested in the key index.

We discussed this a little at CouchHack and I argued that the simplest
solution is actually good for a few reasons.

The simple solution: provide a mechanism to copy the rows of a grouped
reduce function to a new database.

Good because it is most like Hadoop/Google style map reduce. In that
paradigm, the output of a map/reduce job is not incremental, and it is
persisted in a way that allows for multiple later reduce stages to be run
on
it. It's common in Hadoop to chain many m/r stages, and to try a few
iterations of each stage while developing code.

I like this also because it provides the needed functionality without
adding
any new primitives to CouchDB.

The only downside of this approach is that it is not incremental. I'm not
sure that incremental chainability has much promise, as the index
management
could be a pain, especially if you have branching chains.

Another upside is that by reducing to a db, you give the user power to do
things like use replication to merge multiple data sets before applying
more
views.

I don't want to discourage anyone from experimenting with code, just want
to
point out this simple solution which would be Very Easy to implement.


(*)local: I'm assuming that views are not replicated and need to be
recalculated for each CouchDB node. If they are replicated somehow, I
think
it would still work but we'd have to look at it a little more.


With that said, I'd prefer to spend my time extending the view engine
to
handle chainable MR workflows in a single shot.  Especially in the
simple
sort_by_value case it just seems like a cleaner way to go about things.

Yes, that seems to be the gist of all repliers and I agree :-)

In a nutshell, I'm hoping that:
* A review is a new sort of view that has an "inputs" array in its
definition.
* Only MR views are allowed as inputs, no KV duplication allowed.
* It builds a persistent index of the incoming views when those get
updated.
* That index is then used to build the view index for the review when
the
review gets updated.
* I think I covered the most important algorithms needed to implement
this
in my original proposal.

Does this sound feasible? If so I'll update my proposal accordingly.

Wout.


Reply via email to