Re: Proposal: Review DBs

Zachary Zolton Mon, 27 Apr 2009 07:21:44 -0700

@jchris et al,

if you had any pointer, on how to implement this, i have a strong
motivation to try my hand at it.


i have a janky ruby script running as an update notifier that looks
for certain criteria, idiomatic to my data, that puts docs into a
derived database. but i'm not terribly happy with my current
implementation...

is there a general-purpose algorithm for dealing with updates?


cheers,

zach


On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]> wrote:
>
>
> Sent from my iPhone
>
> On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> wrote:
>
>> Hi Adam,
>>
>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>
>>> Hi Wout, thanks for writing this up.
>>>
>>> One comment about the map-only views:  I think you'll find that Couch has
>>> already done a good bit of the work needed to support them, too.  Couch
>>> maintains a btree for each design doc keyed on docid that stores all the
>>> view keys emitted by the maps over each document.  When a document is
>>> updated and then analyzed, Couch has to consult that btree, purge all the
>>> KVs associated with the old version of the doc from each view, and then
>>> insert the new KVs.  So the tracking information correlating docids and view
>>> keys is already available.
>>
>> See I did not know that :-) Although I should have guessed.
>>
>> However, in the mail before this one I argued that it doesn't make sense
>> to combine or chain map-only views since you can always write a map function
>> that does it in one step. Do you agree?
>>
>> You might also know the answer to this: is it possible to make the Review
>> DB be a sort of view index on the current database? All it needs are JSON
>> keys and values, no other fields.
>>
>>> You'd still be left with the problem of generating unique docids for the
>>> documents in the Review DB, but I think that's a problem that needs to be
>>> solved.  The restriction to only MR views with no duplicate keys across
>>> views seems too strong to me.
>>
>> Well, since the Review DB is a local(*) hidden database that's handled a
>> bit specially, I think the easiest is to assign _id a sequence number and
>> create a default view that indexes the documents by doc.key (for updating
>> the value for that key). There will never be contention and we're only
>> interested in the key index.
>
> We discussed this a little at CouchHack and I argued that the simplest
> solution is actually good for a few reasons.
>
> The simple solution: provide a mechanism to copy the rows of a grouped
> reduce function to a new database.
>
> Good because it is most like Hadoop/Google style map reduce. In that
> paradigm, the output of a map/reduce job is not incremental, and it is
> persisted in a way that allows for multiple later reduce stages to be run on
> it. It's common in Hadoop to chain many m/r stages, and to try a few
> iterations of each stage while developing code.
>
> I like this also because it provides the needed functionality without adding
> any new primitives to CouchDB.
>
> The only downside of this approach is that it is not incremental. I'm not
> sure that incremental chainability has much promise, as the index management
> could be a pain, especially if you have branching chains.
>
> Another upside is that by reducing to a db, you give the user power to do
> things like use replication to merge multiple data sets before applying more
> views.
>
> I don't want to discourage anyone from experimenting with code, just want to
> point out this simple solution which would be Very Easy to implement.
>
>>
>>
>> (*)local: I'm assuming that views are not replicated and need to be
>> recalculated for each CouchDB node. If they are replicated somehow, I think
>> it would still work but we'd have to look at it a little more.
>>
>>> With that said, I'd prefer to spend my time extending the view engine to
>>> handle chainable MR workflows in a single shot.  Especially in the simple
>>> sort_by_value case it just seems like a cleaner way to go about things.
>>
>> Yes, that seems to be the gist of all repliers and I agree :-)
>>
>> In a nutshell, I'm hoping that:
>> * A review is a new sort of view that has an "inputs" array in its
>> definition.
>> * Only MR views are allowed as inputs, no KV duplication allowed.
>> * It builds a persistent index of the incoming views when those get
>> updated.
>> * That index is then used to build the view index for the review when the
>> review gets updated.
>> * I think I covered the most important algorithms needed to implement this
>> in my original proposal.
>>
>> Does this sound feasible? If so I'll update my proposal accordingly.
>>
>> Wout.
>

Re: Proposal: Review DBs

Reply via email to