Perhaps the background job could maintain some versioning number, in
the base DB's design doc, which clients could use to know which
version of the derived database to hit.

See? ;^) This might be simpler, as a 1st-class CouchDB feature, than a
bolted-on script.

On Mon, Apr 27, 2009 at 2:48 PM, Paul Davis <[email protected]> wrote:
> Zachary,
>
> Hmm. Naming your derived databases as base_db-stage-etag doesn't sound like
> a bad idea. Though I dunno how you'd communicate to clients to start hitting
> the new versions and also doesn't tell the admins when to drop old indices.
>
> The only thing that comes to mind is to stick some intermediary in between
> clients and the actual derived data to make it transparent and also to allow
> you to know when you can clean up old versions etc.
>
> I'll keep thinking on it.
>
> Paul
>
> Zachary Zolton wrote:
>>
>> okay, i'm starting to get ya. my question is, if i'm constantly
>> dropping/recreating/reindexing the derived database, how can i keep
>> serving requests from my website?
>>
>> one possible solution would be to time/etag/etc-stamp the derived db
>> name, but that would seem to add a number of moving parts to my
>> system.
>>
>> hmm... any ideas of how to pull a quick switcheroo on the backend of
>> my system, without too much hassle in the client code?
>>
>> On Mon, Apr 27, 2009 at 1:44 PM, Paul Davis <[email protected]>
>> wrote:
>>
>>>
>>> Zachary,
>>>
>>> No worries, the rough outline I'd do here is something like:
>>>
>>> 1. Figure out some member structure in the _design document that will
>>> represent your data flow. For them moment I would do something extremely
>>> simple as in:
>>>
>>> Assume:
>>> db_name = "base_db"
>>>
>>> {
>>>  "_id":  " _design/base",
>>>  "views": {
>>>      "stage-1": {
>>>          "map": "function(doc) ...",
>>>          "reduce": "function(keys, values, rereduce) ..."
>>>      }
>>>  },
>>>  "review": [
>>>      {"map": "function(doc) ...", "reduce": "function(keys, vals,
>>> rereduce)
>>> ..."},
>>>      {"map": "function(doc) ...", "reduce": "function(keys, vals,
>>> rereduce)
>>> ..."}
>>>  ]
>>> }
>>>
>>> So the review member becomes the stages in your data flow. I'm avoiding
>>> any
>>> forking or merging in this example in honor of the "make it work, make it
>>> not suck" development flow.
>>>
>>> Now the basic algorithm would be something like:
>>>
>>> For each array element in the "review" member, create a db something
>>> like:
>>>
>>> base_db-stage-1 with a design document that contains a view with the
>>> first
>>> element of the "reviews" member.
>>> base_db-stage-2 with the second member and so on.
>>>
>>> Then your script can check the view status in each database either with a
>>> cron (or an update_notifier) to do so, you can just:
>>>
>>> HEAD /base_db/_design/base/_view/stage-1
>>>
>>> And then check the returned ETag. For the moment this is exactly
>>> equivalent
>>> to checking the database's update_seq because of how the etag is
>>> calculated,
>>> but in the future when we track the last update_seq for each view change
>>> this will be a free upgrade. Plus there's a bit more logical-ness to
>>> checking "view state" instead of "db state".
>>>
>>> When the etag's don't match, you can just drop the next db in the flow,
>>> create it, and then copy the view output. The drop/create just makes the
>>> algorithm easily implementable for now. In the future there can be some
>>> extra logic to only change the new view as far as it requires by
>>> iterating
>>> over the two views and doing a merge sortish type of thing. I think...
>>> Sounds like there should be a way at least.
>>>
>>> Once that works we can look at bolting on different fancy things like
>>> having
>>> forking map/reduce mechanisms and my current pet idea of adding in the
>>> merge
>>> stuff that has been talked about.
>>>
>>> This is actually starting to sound like a fun little project....
>>>
>>> HTH,
>>> Paul Davis
>>>
>>> Zachary Zolton wrote:
>>>
>>>>
>>>> paul
>>>>
>>>> alright... you've gotta give me the remedial explanation of what you
>>>> meant here! (sorry, i'm still noob-ish)
>>>>
>>>> so, are you saying that i shouldn't even check for individual doc
>>>> updates, but instead just recreate the entire database? that sounds
>>>> like a job for cron, more so than the update notifier, right?
>>>>
>>>> i'd put up my current ruby script, but it deals update notifications
>>>> in a way that's very specific to my data —probably very naïvely, to
>>>> boot!
>>>>
>>>> zach
>>>>
>>>> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
>>>> <[email protected]> wrote:
>>>>
>>>>
>>>>>
>>>>> Zachary,
>>>>>
>>>>> Awesome. The thing with non-incremental updates is that the basic
>>>>> algorithm
>>>>> would be to just look for updates to the view and on update, delete the
>>>>> review DB, create a new one, and then dump the new data into it. I
>>>>> wouldn't
>>>>> try too hard for the optimizing updates at this point in time.
>>>>>
>>>>> Getting a ruby script out to show the basics should probably be the
>>>>> first
>>>>> step. Beyond that we'll have to take it a step at a time.
>>>>>
>>>>> HTH,
>>>>> Paul Davis
>>>>>
>>>>>
>>>>> Zachary Zolton wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> @jchris et al,
>>>>>>
>>>>>> if you had any pointer, on how to implement this, i have a strong
>>>>>> motivation to try my hand at it.
>>>>>>
>>>>>> i have a janky ruby script running as an update notifier that looks
>>>>>> for certain criteria, idiomatic to my data, that puts docs into a
>>>>>> derived database. but i'm not terribly happy with my current
>>>>>> implementation...
>>>>>>
>>>>>> is there a general-purpose algorithm for dealing with updates?
>>>>>>
>>>>>>
>>>>>> cheers,
>>>>>>
>>>>>> zach
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Adam,
>>>>>>>>
>>>>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Wout, thanks for writing this up.
>>>>>>>>>
>>>>>>>>> One comment about the map-only views:  I think you'll find that
>>>>>>>>> Couch
>>>>>>>>> has
>>>>>>>>> already done a good bit of the work needed to support them, too.
>>>>>>>>>  Couch
>>>>>>>>> maintains a btree for each design doc keyed on docid that stores
>>>>>>>>> all
>>>>>>>>> the
>>>>>>>>> view keys emitted by the maps over each document.  When a document
>>>>>>>>> is
>>>>>>>>> updated and then analyzed, Couch has to consult that btree, purge
>>>>>>>>> all
>>>>>>>>> the
>>>>>>>>> KVs associated with the old version of the doc from each view, and
>>>>>>>>> then
>>>>>>>>> insert the new KVs.  So the tracking information correlating docids
>>>>>>>>> and
>>>>>>>>> view
>>>>>>>>> keys is already available.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> See I did not know that :-) Although I should have guessed.
>>>>>>>>
>>>>>>>> However, in the mail before this one I argued that it doesn't make
>>>>>>>> sense
>>>>>>>> to combine or chain map-only views since you can always write a map
>>>>>>>> function
>>>>>>>> that does it in one step. Do you agree?
>>>>>>>>
>>>>>>>> You might also know the answer to this: is it possible to make the
>>>>>>>> Review
>>>>>>>> DB be a sort of view index on the current database? All it needs are
>>>>>>>> JSON
>>>>>>>> keys and values, no other fields.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> You'd still be left with the problem of generating unique docids
>>>>>>>>> for
>>>>>>>>> the
>>>>>>>>> documents in the Review DB, but I think that's a problem that needs
>>>>>>>>> to
>>>>>>>>> be
>>>>>>>>> solved.  The restriction to only MR views with no duplicate keys
>>>>>>>>> across
>>>>>>>>> views seems too strong to me.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, since the Review DB is a local(*) hidden database that's
>>>>>>>> handled
>>>>>>>> a
>>>>>>>> bit specially, I think the easiest is to assign _id a sequence
>>>>>>>> number
>>>>>>>> and
>>>>>>>> create a default view that indexes the documents by doc.key (for
>>>>>>>> updating
>>>>>>>> the value for that key). There will never be contention and we're
>>>>>>>> only
>>>>>>>> interested in the key index.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> We discussed this a little at CouchHack and I argued that the
>>>>>>> simplest
>>>>>>> solution is actually good for a few reasons.
>>>>>>>
>>>>>>> The simple solution: provide a mechanism to copy the rows of a
>>>>>>> grouped
>>>>>>> reduce function to a new database.
>>>>>>>
>>>>>>> Good because it is most like Hadoop/Google style map reduce. In that
>>>>>>> paradigm, the output of a map/reduce job is not incremental, and it
>>>>>>> is
>>>>>>> persisted in a way that allows for multiple later reduce stages to be
>>>>>>> run
>>>>>>> on
>>>>>>> it. It's common in Hadoop to chain many m/r stages, and to try a few
>>>>>>> iterations of each stage while developing code.
>>>>>>>
>>>>>>> I like this also because it provides the needed functionality without
>>>>>>> adding
>>>>>>> any new primitives to CouchDB.
>>>>>>>
>>>>>>> The only downside of this approach is that it is not incremental. I'm
>>>>>>> not
>>>>>>> sure that incremental chainability has much promise, as the index
>>>>>>> management
>>>>>>> could be a pain, especially if you have branching chains.
>>>>>>>
>>>>>>> Another upside is that by reducing to a db, you give the user power
>>>>>>> to
>>>>>>> do
>>>>>>> things like use replication to merge multiple data sets before
>>>>>>> applying
>>>>>>> more
>>>>>>> views.
>>>>>>>
>>>>>>> I don't want to discourage anyone from experimenting with code, just
>>>>>>> want
>>>>>>> to
>>>>>>> point out this simple solution which would be Very Easy to implement.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> (*)local: I'm assuming that views are not replicated and need to be
>>>>>>>> recalculated for each CouchDB node. If they are replicated somehow,
>>>>>>>> I
>>>>>>>> think
>>>>>>>> it would still work but we'd have to look at it a little more.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> With that said, I'd prefer to spend my time extending the view
>>>>>>>>> engine
>>>>>>>>> to
>>>>>>>>> handle chainable MR workflows in a single shot.  Especially in the
>>>>>>>>> simple
>>>>>>>>> sort_by_value case it just seems like a cleaner way to go about
>>>>>>>>> things.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, that seems to be the gist of all repliers and I agree :-)
>>>>>>>>
>>>>>>>> In a nutshell, I'm hoping that:
>>>>>>>> * A review is a new sort of view that has an "inputs" array in its
>>>>>>>> definition.
>>>>>>>> * Only MR views are allowed as inputs, no KV duplication allowed.
>>>>>>>> * It builds a persistent index of the incoming views when those get
>>>>>>>> updated.
>>>>>>>> * That index is then used to build the view index for the review
>>>>>>>> when
>>>>>>>> the
>>>>>>>> review gets updated.
>>>>>>>> * I think I covered the most important algorithms needed to
>>>>>>>> implement
>>>>>>>> this
>>>>>>>> in my original proposal.
>>>>>>>>
>>>>>>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>>>>>>
>>>>>>>> Wout.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

Reply via email to