okay, i'm starting to get ya. my question is, if i'm constantly dropping/recreating/reindexing the derived database, how can i keep serving requests from my website?
one possible solution would be to time/etag/etc-stamp the derived db name, but that would seem to add a number of moving parts to my system. hmm... any ideas of how to pull a quick switcheroo on the backend of my system, without too much hassle in the client code? On Mon, Apr 27, 2009 at 1:44 PM, Paul Davis <[email protected]> wrote: > Zachary, > > No worries, the rough outline I'd do here is something like: > > 1. Figure out some member structure in the _design document that will > represent your data flow. For them moment I would do something extremely > simple as in: > > Assume: > db_name = "base_db" > > { > "_id": " _design/base", > "views": { > "stage-1": { > "map": "function(doc) ...", > "reduce": "function(keys, values, rereduce) ..." > } > }, > "review": [ > {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce) > ..."}, > {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce) > ..."} > ] > } > > So the review member becomes the stages in your data flow. I'm avoiding any > forking or merging in this example in honor of the "make it work, make it > not suck" development flow. > > Now the basic algorithm would be something like: > > For each array element in the "review" member, create a db something like: > > base_db-stage-1 with a design document that contains a view with the first > element of the "reviews" member. > base_db-stage-2 with the second member and so on. > > Then your script can check the view status in each database either with a > cron (or an update_notifier) to do so, you can just: > > HEAD /base_db/_design/base/_view/stage-1 > > And then check the returned ETag. For the moment this is exactly equivalent > to checking the database's update_seq because of how the etag is calculated, > but in the future when we track the last update_seq for each view change > this will be a free upgrade. Plus there's a bit more logical-ness to > checking "view state" instead of "db state". > > When the etag's don't match, you can just drop the next db in the flow, > create it, and then copy the view output. The drop/create just makes the > algorithm easily implementable for now. In the future there can be some > extra logic to only change the new view as far as it requires by iterating > over the two views and doing a merge sortish type of thing. I think... > Sounds like there should be a way at least. > > Once that works we can look at bolting on different fancy things like having > forking map/reduce mechanisms and my current pet idea of adding in the merge > stuff that has been talked about. > > This is actually starting to sound like a fun little project.... > > HTH, > Paul Davis > > Zachary Zolton wrote: >> >> paul >> >> alright... you've gotta give me the remedial explanation of what you >> meant here! (sorry, i'm still noob-ish) >> >> so, are you saying that i shouldn't even check for individual doc >> updates, but instead just recreate the entire database? that sounds >> like a job for cron, more so than the update notifier, right? >> >> i'd put up my current ruby script, but it deals update notifications >> in a way that's very specific to my data —probably very naïvely, to >> boot! >> >> zach >> >> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis >> <[email protected]> wrote: >> >>> >>> Zachary, >>> >>> Awesome. The thing with non-incremental updates is that the basic >>> algorithm >>> would be to just look for updates to the view and on update, delete the >>> review DB, create a new one, and then dump the new data into it. I >>> wouldn't >>> try too hard for the optimizing updates at this point in time. >>> >>> Getting a ruby script out to show the basics should probably be the first >>> step. Beyond that we'll have to take it a step at a time. >>> >>> HTH, >>> Paul Davis >>> >>> >>> Zachary Zolton wrote: >>> >>>> >>>> @jchris et al, >>>> >>>> if you had any pointer, on how to implement this, i have a strong >>>> motivation to try my hand at it. >>>> >>>> i have a janky ruby script running as an update notifier that looks >>>> for certain criteria, idiomatic to my data, that puts docs into a >>>> derived database. but i'm not terribly happy with my current >>>> implementation... >>>> >>>> is there a general-purpose algorithm for dealing with updates? >>>> >>>> >>>> cheers, >>>> >>>> zach >>>> >>>> >>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]> >>>> wrote: >>>> >>>> >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> >>>>>> Hi Adam, >>>>>> >>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote: >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Hi Wout, thanks for writing this up. >>>>>>> >>>>>>> One comment about the map-only views: I think you'll find that Couch >>>>>>> has >>>>>>> already done a good bit of the work needed to support them, too. >>>>>>> Couch >>>>>>> maintains a btree for each design doc keyed on docid that stores all >>>>>>> the >>>>>>> view keys emitted by the maps over each document. When a document is >>>>>>> updated and then analyzed, Couch has to consult that btree, purge all >>>>>>> the >>>>>>> KVs associated with the old version of the doc from each view, and >>>>>>> then >>>>>>> insert the new KVs. So the tracking information correlating docids >>>>>>> and >>>>>>> view >>>>>>> keys is already available. >>>>>>> >>>>>>> >>>>>> >>>>>> See I did not know that :-) Although I should have guessed. >>>>>> >>>>>> However, in the mail before this one I argued that it doesn't make >>>>>> sense >>>>>> to combine or chain map-only views since you can always write a map >>>>>> function >>>>>> that does it in one step. Do you agree? >>>>>> >>>>>> You might also know the answer to this: is it possible to make the >>>>>> Review >>>>>> DB be a sort of view index on the current database? All it needs are >>>>>> JSON >>>>>> keys and values, no other fields. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> You'd still be left with the problem of generating unique docids for >>>>>>> the >>>>>>> documents in the Review DB, but I think that's a problem that needs >>>>>>> to >>>>>>> be >>>>>>> solved. The restriction to only MR views with no duplicate keys >>>>>>> across >>>>>>> views seems too strong to me. >>>>>>> >>>>>>> >>>>>> >>>>>> Well, since the Review DB is a local(*) hidden database that's handled >>>>>> a >>>>>> bit specially, I think the easiest is to assign _id a sequence number >>>>>> and >>>>>> create a default view that indexes the documents by doc.key (for >>>>>> updating >>>>>> the value for that key). There will never be contention and we're only >>>>>> interested in the key index. >>>>>> >>>>>> >>>>> >>>>> We discussed this a little at CouchHack and I argued that the simplest >>>>> solution is actually good for a few reasons. >>>>> >>>>> The simple solution: provide a mechanism to copy the rows of a grouped >>>>> reduce function to a new database. >>>>> >>>>> Good because it is most like Hadoop/Google style map reduce. In that >>>>> paradigm, the output of a map/reduce job is not incremental, and it is >>>>> persisted in a way that allows for multiple later reduce stages to be >>>>> run >>>>> on >>>>> it. It's common in Hadoop to chain many m/r stages, and to try a few >>>>> iterations of each stage while developing code. >>>>> >>>>> I like this also because it provides the needed functionality without >>>>> adding >>>>> any new primitives to CouchDB. >>>>> >>>>> The only downside of this approach is that it is not incremental. I'm >>>>> not >>>>> sure that incremental chainability has much promise, as the index >>>>> management >>>>> could be a pain, especially if you have branching chains. >>>>> >>>>> Another upside is that by reducing to a db, you give the user power to >>>>> do >>>>> things like use replication to merge multiple data sets before applying >>>>> more >>>>> views. >>>>> >>>>> I don't want to discourage anyone from experimenting with code, just >>>>> want >>>>> to >>>>> point out this simple solution which would be Very Easy to implement. >>>>> >>>>> >>>>> >>>>>> >>>>>> (*)local: I'm assuming that views are not replicated and need to be >>>>>> recalculated for each CouchDB node. If they are replicated somehow, I >>>>>> think >>>>>> it would still work but we'd have to look at it a little more. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> With that said, I'd prefer to spend my time extending the view engine >>>>>>> to >>>>>>> handle chainable MR workflows in a single shot. Especially in the >>>>>>> simple >>>>>>> sort_by_value case it just seems like a cleaner way to go about >>>>>>> things. >>>>>>> >>>>>>> >>>>>> >>>>>> Yes, that seems to be the gist of all repliers and I agree :-) >>>>>> >>>>>> In a nutshell, I'm hoping that: >>>>>> * A review is a new sort of view that has an "inputs" array in its >>>>>> definition. >>>>>> * Only MR views are allowed as inputs, no KV duplication allowed. >>>>>> * It builds a persistent index of the incoming views when those get >>>>>> updated. >>>>>> * That index is then used to build the view index for the review when >>>>>> the >>>>>> review gets updated. >>>>>> * I think I covered the most important algorithms needed to implement >>>>>> this >>>>>> in my original proposal. >>>>>> >>>>>> Does this sound feasible? If so I'll update my proposal accordingly. >>>>>> >>>>>> Wout. >>>>>> >>>>>> >>> >>> > >
