Perhaps the background job could maintain some versioning number, in the base DB's design doc, which clients could use to know which version of the derived database to hit.
See? ;^) This might be simpler, as a 1st-class CouchDB feature, than a bolted-on script. On Mon, Apr 27, 2009 at 2:48 PM, Paul Davis <[email protected]> wrote: > Zachary, > > Hmm. Naming your derived databases as base_db-stage-etag doesn't sound like > a bad idea. Though I dunno how you'd communicate to clients to start hitting > the new versions and also doesn't tell the admins when to drop old indices. > > The only thing that comes to mind is to stick some intermediary in between > clients and the actual derived data to make it transparent and also to allow > you to know when you can clean up old versions etc. > > I'll keep thinking on it. > > Paul > > Zachary Zolton wrote: >> >> okay, i'm starting to get ya. my question is, if i'm constantly >> dropping/recreating/reindexing the derived database, how can i keep >> serving requests from my website? >> >> one possible solution would be to time/etag/etc-stamp the derived db >> name, but that would seem to add a number of moving parts to my >> system. >> >> hmm... any ideas of how to pull a quick switcheroo on the backend of >> my system, without too much hassle in the client code? >> >> On Mon, Apr 27, 2009 at 1:44 PM, Paul Davis <[email protected]> >> wrote: >> >>> >>> Zachary, >>> >>> No worries, the rough outline I'd do here is something like: >>> >>> 1. Figure out some member structure in the _design document that will >>> represent your data flow. For them moment I would do something extremely >>> simple as in: >>> >>> Assume: >>> db_name = "base_db" >>> >>> { >>> "_id": " _design/base", >>> "views": { >>> "stage-1": { >>> "map": "function(doc) ...", >>> "reduce": "function(keys, values, rereduce) ..." >>> } >>> }, >>> "review": [ >>> {"map": "function(doc) ...", "reduce": "function(keys, vals, >>> rereduce) >>> ..."}, >>> {"map": "function(doc) ...", "reduce": "function(keys, vals, >>> rereduce) >>> ..."} >>> ] >>> } >>> >>> So the review member becomes the stages in your data flow. I'm avoiding >>> any >>> forking or merging in this example in honor of the "make it work, make it >>> not suck" development flow. >>> >>> Now the basic algorithm would be something like: >>> >>> For each array element in the "review" member, create a db something >>> like: >>> >>> base_db-stage-1 with a design document that contains a view with the >>> first >>> element of the "reviews" member. >>> base_db-stage-2 with the second member and so on. >>> >>> Then your script can check the view status in each database either with a >>> cron (or an update_notifier) to do so, you can just: >>> >>> HEAD /base_db/_design/base/_view/stage-1 >>> >>> And then check the returned ETag. For the moment this is exactly >>> equivalent >>> to checking the database's update_seq because of how the etag is >>> calculated, >>> but in the future when we track the last update_seq for each view change >>> this will be a free upgrade. Plus there's a bit more logical-ness to >>> checking "view state" instead of "db state". >>> >>> When the etag's don't match, you can just drop the next db in the flow, >>> create it, and then copy the view output. The drop/create just makes the >>> algorithm easily implementable for now. In the future there can be some >>> extra logic to only change the new view as far as it requires by >>> iterating >>> over the two views and doing a merge sortish type of thing. I think... >>> Sounds like there should be a way at least. >>> >>> Once that works we can look at bolting on different fancy things like >>> having >>> forking map/reduce mechanisms and my current pet idea of adding in the >>> merge >>> stuff that has been talked about. >>> >>> This is actually starting to sound like a fun little project.... >>> >>> HTH, >>> Paul Davis >>> >>> Zachary Zolton wrote: >>> >>>> >>>> paul >>>> >>>> alright... you've gotta give me the remedial explanation of what you >>>> meant here! (sorry, i'm still noob-ish) >>>> >>>> so, are you saying that i shouldn't even check for individual doc >>>> updates, but instead just recreate the entire database? that sounds >>>> like a job for cron, more so than the update notifier, right? >>>> >>>> i'd put up my current ruby script, but it deals update notifications >>>> in a way that's very specific to my data —probably very naïvely, to >>>> boot! >>>> >>>> zach >>>> >>>> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis >>>> <[email protected]> wrote: >>>> >>>> >>>>> >>>>> Zachary, >>>>> >>>>> Awesome. The thing with non-incremental updates is that the basic >>>>> algorithm >>>>> would be to just look for updates to the view and on update, delete the >>>>> review DB, create a new one, and then dump the new data into it. I >>>>> wouldn't >>>>> try too hard for the optimizing updates at this point in time. >>>>> >>>>> Getting a ruby script out to show the basics should probably be the >>>>> first >>>>> step. Beyond that we'll have to take it a step at a time. >>>>> >>>>> HTH, >>>>> Paul Davis >>>>> >>>>> >>>>> Zachary Zolton wrote: >>>>> >>>>> >>>>>> >>>>>> @jchris et al, >>>>>> >>>>>> if you had any pointer, on how to implement this, i have a strong >>>>>> motivation to try my hand at it. >>>>>> >>>>>> i have a janky ruby script running as an update notifier that looks >>>>>> for certain criteria, idiomatic to my data, that puts docs into a >>>>>> derived database. but i'm not terribly happy with my current >>>>>> implementation... >>>>>> >>>>>> is there a general-purpose algorithm for dealing with updates? >>>>>> >>>>>> >>>>>> cheers, >>>>>> >>>>>> zach >>>>>> >>>>>> >>>>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Hi Adam, >>>>>>>> >>>>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Hi Wout, thanks for writing this up. >>>>>>>>> >>>>>>>>> One comment about the map-only views: I think you'll find that >>>>>>>>> Couch >>>>>>>>> has >>>>>>>>> already done a good bit of the work needed to support them, too. >>>>>>>>> Couch >>>>>>>>> maintains a btree for each design doc keyed on docid that stores >>>>>>>>> all >>>>>>>>> the >>>>>>>>> view keys emitted by the maps over each document. When a document >>>>>>>>> is >>>>>>>>> updated and then analyzed, Couch has to consult that btree, purge >>>>>>>>> all >>>>>>>>> the >>>>>>>>> KVs associated with the old version of the doc from each view, and >>>>>>>>> then >>>>>>>>> insert the new KVs. So the tracking information correlating docids >>>>>>>>> and >>>>>>>>> view >>>>>>>>> keys is already available. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> See I did not know that :-) Although I should have guessed. >>>>>>>> >>>>>>>> However, in the mail before this one I argued that it doesn't make >>>>>>>> sense >>>>>>>> to combine or chain map-only views since you can always write a map >>>>>>>> function >>>>>>>> that does it in one step. Do you agree? >>>>>>>> >>>>>>>> You might also know the answer to this: is it possible to make the >>>>>>>> Review >>>>>>>> DB be a sort of view index on the current database? All it needs are >>>>>>>> JSON >>>>>>>> keys and values, no other fields. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> You'd still be left with the problem of generating unique docids >>>>>>>>> for >>>>>>>>> the >>>>>>>>> documents in the Review DB, but I think that's a problem that needs >>>>>>>>> to >>>>>>>>> be >>>>>>>>> solved. The restriction to only MR views with no duplicate keys >>>>>>>>> across >>>>>>>>> views seems too strong to me. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Well, since the Review DB is a local(*) hidden database that's >>>>>>>> handled >>>>>>>> a >>>>>>>> bit specially, I think the easiest is to assign _id a sequence >>>>>>>> number >>>>>>>> and >>>>>>>> create a default view that indexes the documents by doc.key (for >>>>>>>> updating >>>>>>>> the value for that key). There will never be contention and we're >>>>>>>> only >>>>>>>> interested in the key index. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> We discussed this a little at CouchHack and I argued that the >>>>>>> simplest >>>>>>> solution is actually good for a few reasons. >>>>>>> >>>>>>> The simple solution: provide a mechanism to copy the rows of a >>>>>>> grouped >>>>>>> reduce function to a new database. >>>>>>> >>>>>>> Good because it is most like Hadoop/Google style map reduce. In that >>>>>>> paradigm, the output of a map/reduce job is not incremental, and it >>>>>>> is >>>>>>> persisted in a way that allows for multiple later reduce stages to be >>>>>>> run >>>>>>> on >>>>>>> it. It's common in Hadoop to chain many m/r stages, and to try a few >>>>>>> iterations of each stage while developing code. >>>>>>> >>>>>>> I like this also because it provides the needed functionality without >>>>>>> adding >>>>>>> any new primitives to CouchDB. >>>>>>> >>>>>>> The only downside of this approach is that it is not incremental. I'm >>>>>>> not >>>>>>> sure that incremental chainability has much promise, as the index >>>>>>> management >>>>>>> could be a pain, especially if you have branching chains. >>>>>>> >>>>>>> Another upside is that by reducing to a db, you give the user power >>>>>>> to >>>>>>> do >>>>>>> things like use replication to merge multiple data sets before >>>>>>> applying >>>>>>> more >>>>>>> views. >>>>>>> >>>>>>> I don't want to discourage anyone from experimenting with code, just >>>>>>> want >>>>>>> to >>>>>>> point out this simple solution which would be Very Easy to implement. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> (*)local: I'm assuming that views are not replicated and need to be >>>>>>>> recalculated for each CouchDB node. If they are replicated somehow, >>>>>>>> I >>>>>>>> think >>>>>>>> it would still work but we'd have to look at it a little more. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> With that said, I'd prefer to spend my time extending the view >>>>>>>>> engine >>>>>>>>> to >>>>>>>>> handle chainable MR workflows in a single shot. Especially in the >>>>>>>>> simple >>>>>>>>> sort_by_value case it just seems like a cleaner way to go about >>>>>>>>> things. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Yes, that seems to be the gist of all repliers and I agree :-) >>>>>>>> >>>>>>>> In a nutshell, I'm hoping that: >>>>>>>> * A review is a new sort of view that has an "inputs" array in its >>>>>>>> definition. >>>>>>>> * Only MR views are allowed as inputs, no KV duplication allowed. >>>>>>>> * It builds a persistent index of the incoming views when those get >>>>>>>> updated. >>>>>>>> * That index is then used to build the view index for the review >>>>>>>> when >>>>>>>> the >>>>>>>> review gets updated. >>>>>>>> * I think I covered the most important algorithms needed to >>>>>>>> implement >>>>>>>> this >>>>>>>> in my original proposal. >>>>>>>> >>>>>>>> Does this sound feasible? If so I'll update my proposal accordingly. >>>>>>>> >>>>>>>> Wout. >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>> >>> > >
