Re: Proposal: Review DBs

Zachary Zolton Mon, 27 Apr 2009 12:28:17 -0700

okay, i'm starting to get ya. my question is, if i'm constantly
dropping/recreating/reindexing the derived database, how can i keep
serving requests from my website?


one possible solution would be to time/etag/etc-stamp the derived db
name, but that would seem to add a number of moving parts to my
system.

hmm... any ideas of how to pull a quick switcheroo on the backend of
my system, without too much hassle in the client code?

On Mon, Apr 27, 2009 at 1:44 PM, Paul Davis <[email protected]> wrote:
> Zachary,
>
> No worries, the rough outline I'd do here is something like:
>
> 1. Figure out some member structure in the _design document that will
> represent your data flow. For them moment I would do something extremely
> simple as in:
>
> Assume:
> db_name = "base_db"
>
> {
>   "_id":  " _design/base",
>   "views": {
>       "stage-1": {
>           "map": "function(doc) ...",
>           "reduce": "function(keys, values, rereduce) ..."
>       }
>   },
>   "review": [
>       {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce)
> ..."},
>       {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce)
> ..."}
>   ]
> }
>
> So the review member becomes the stages in your data flow. I'm avoiding any
> forking or merging in this example in honor of the "make it work, make it
> not suck" development flow.
>
> Now the basic algorithm would be something like:
>
> For each array element in the "review" member, create a db something like:
>
> base_db-stage-1 with a design document that contains a view with the first
> element of the "reviews" member.
> base_db-stage-2 with the second member and so on.
>
> Then your script can check the view status in each database either with a
> cron (or an update_notifier) to do so, you can just:
>
> HEAD /base_db/_design/base/_view/stage-1
>
> And then check the returned ETag. For the moment this is exactly equivalent
> to checking the database's update_seq because of how the etag is calculated,
> but in the future when we track the last update_seq for each view change
> this will be a free upgrade. Plus there's a bit more logical-ness to
> checking "view state" instead of "db state".
>
> When the etag's don't match, you can just drop the next db in the flow,
> create it, and then copy the view output. The drop/create just makes the
> algorithm easily implementable for now. In the future there can be some
> extra logic to only change the new view as far as it requires by iterating
> over the two views and doing a merge sortish type of thing. I think...
> Sounds like there should be a way at least.
>
> Once that works we can look at bolting on different fancy things like having
> forking map/reduce mechanisms and my current pet idea of adding in the merge
> stuff that has been talked about.
>
> This is actually starting to sound like a fun little project....
>
> HTH,
> Paul Davis
>
> Zachary Zolton wrote:
>>
>> paul
>>
>> alright... you've gotta give me the remedial explanation of what you
>> meant here! (sorry, i'm still noob-ish)
>>
>> so, are you saying that i shouldn't even check for individual doc
>> updates, but instead just recreate the entire database? that sounds
>> like a job for cron, more so than the update notifier, right?
>>
>> i'd put up my current ruby script, but it deals update notifications
>> in a way that's very specific to my data —probably very naïvely, to
>> boot!
>>
>> zach
>>
>> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
>> <[email protected]> wrote:
>>
>>>
>>> Zachary,
>>>
>>> Awesome. The thing with non-incremental updates is that the basic
>>> algorithm
>>> would be to just look for updates to the view and on update, delete the
>>> review DB, create a new one, and then dump the new data into it. I
>>> wouldn't
>>> try too hard for the optimizing updates at this point in time.
>>>
>>> Getting a ruby script out to show the basics should probably be the first
>>> step. Beyond that we'll have to take it a step at a time.
>>>
>>> HTH,
>>> Paul Davis
>>>
>>>
>>> Zachary Zolton wrote:
>>>
>>>>
>>>> @jchris et al,
>>>>
>>>> if you had any pointer, on how to implement this, i have a strong
>>>> motivation to try my hand at it.
>>>>
>>>> i have a janky ruby script running as an update notifier that looks
>>>> for certain criteria, idiomatic to my data, that puts docs into a
>>>> derived database. but i'm not terribly happy with my current
>>>> implementation...
>>>>
>>>> is there a general-purpose algorithm for dealing with updates?
>>>>
>>>>
>>>> cheers,
>>>>
>>>> zach
>>>>
>>>>
>>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Hi Adam,
>>>>>>
>>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi Wout, thanks for writing this up.
>>>>>>>
>>>>>>> One comment about the map-only views:  I think you'll find that Couch
>>>>>>> has
>>>>>>> already done a good bit of the work needed to support them, too.
>>>>>>>  Couch
>>>>>>> maintains a btree for each design doc keyed on docid that stores all
>>>>>>> the
>>>>>>> view keys emitted by the maps over each document.  When a document is
>>>>>>> updated and then analyzed, Couch has to consult that btree, purge all
>>>>>>> the
>>>>>>> KVs associated with the old version of the doc from each view, and
>>>>>>> then
>>>>>>> insert the new KVs.  So the tracking information correlating docids
>>>>>>> and
>>>>>>> view
>>>>>>> keys is already available.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> See I did not know that :-) Although I should have guessed.
>>>>>>
>>>>>> However, in the mail before this one I argued that it doesn't make
>>>>>> sense
>>>>>> to combine or chain map-only views since you can always write a map
>>>>>> function
>>>>>> that does it in one step. Do you agree?
>>>>>>
>>>>>> You might also know the answer to this: is it possible to make the
>>>>>> Review
>>>>>> DB be a sort of view index on the current database? All it needs are
>>>>>> JSON
>>>>>> keys and values, no other fields.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> You'd still be left with the problem of generating unique docids for
>>>>>>> the
>>>>>>> documents in the Review DB, but I think that's a problem that needs
>>>>>>> to
>>>>>>> be
>>>>>>> solved.  The restriction to only MR views with no duplicate keys
>>>>>>> across
>>>>>>> views seems too strong to me.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Well, since the Review DB is a local(*) hidden database that's handled
>>>>>> a
>>>>>> bit specially, I think the easiest is to assign _id a sequence number
>>>>>> and
>>>>>> create a default view that indexes the documents by doc.key (for
>>>>>> updating
>>>>>> the value for that key). There will never be contention and we're only
>>>>>> interested in the key index.
>>>>>>
>>>>>>
>>>>>
>>>>> We discussed this a little at CouchHack and I argued that the simplest
>>>>> solution is actually good for a few reasons.
>>>>>
>>>>> The simple solution: provide a mechanism to copy the rows of a grouped
>>>>> reduce function to a new database.
>>>>>
>>>>> Good because it is most like Hadoop/Google style map reduce. In that
>>>>> paradigm, the output of a map/reduce job is not incremental, and it is
>>>>> persisted in a way that allows for multiple later reduce stages to be
>>>>> run
>>>>> on
>>>>> it. It's common in Hadoop to chain many m/r stages, and to try a few
>>>>> iterations of each stage while developing code.
>>>>>
>>>>> I like this also because it provides the needed functionality without
>>>>> adding
>>>>> any new primitives to CouchDB.
>>>>>
>>>>> The only downside of this approach is that it is not incremental. I'm
>>>>> not
>>>>> sure that incremental chainability has much promise, as the index
>>>>> management
>>>>> could be a pain, especially if you have branching chains.
>>>>>
>>>>> Another upside is that by reducing to a db, you give the user power to
>>>>> do
>>>>> things like use replication to merge multiple data sets before applying
>>>>> more
>>>>> views.
>>>>>
>>>>> I don't want to discourage anyone from experimenting with code, just
>>>>> want
>>>>> to
>>>>> point out this simple solution which would be Very Easy to implement.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> (*)local: I'm assuming that views are not replicated and need to be
>>>>>> recalculated for each CouchDB node. If they are replicated somehow, I
>>>>>> think
>>>>>> it would still work but we'd have to look at it a little more.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> With that said, I'd prefer to spend my time extending the view engine
>>>>>>> to
>>>>>>> handle chainable MR workflows in a single shot.  Especially in the
>>>>>>> simple
>>>>>>> sort_by_value case it just seems like a cleaner way to go about
>>>>>>> things.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yes, that seems to be the gist of all repliers and I agree :-)
>>>>>>
>>>>>> In a nutshell, I'm hoping that:
>>>>>> * A review is a new sort of view that has an "inputs" array in its
>>>>>> definition.
>>>>>> * Only MR views are allowed as inputs, no KV duplication allowed.
>>>>>> * It builds a persistent index of the incoming views when those get
>>>>>> updated.
>>>>>> * That index is then used to build the view index for the review when
>>>>>> the
>>>>>> review gets updated.
>>>>>> * I think I covered the most important algorithms needed to implement
>>>>>> this
>>>>>> in my original proposal.
>>>>>>
>>>>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>>>>
>>>>>> Wout.
>>>>>>
>>>>>>
>>>
>>>
>
>

Re: Proposal: Review DBs

Reply via email to