Re: Proposal: Review DBs

Adam Kocoloski Wed, 22 Apr 2009 11:32:04 -0700

Hi Zachary, something like that. The more I think about the problemthe more I converge on a solution like what Wout has proposed. Somequick thoughts:

* Remapping the output of a map isn't terribly useful. All the poweris in remapping the output of a reduction.

* Incremental generation of some complex multi-MR view still requirespersisting the output of each step individually, even if you're onlyinterested in the final result. At least, I don't yet see a cleverway around it.

* Dumping the output of the first MR(s) into a Review DB is an easyway to take advantage of code that's already written, but it's a bitwasteful. We could just take the results directly from the viewbtree(s) and send them to the next step in the workflow.

* I'm not yet sold on the HTTP API in Wout's proposal. I think I'dprefer to keep the existing API, and in the _design doc specify thefull workflow required to generate a given view.


Cheers, Adam

On Apr 22, 2009, at 10:53 AM, Zachary Zolton wrote:

Such as having view definition, in the design doc, contain an array of
objects, each with the map/reduce function pair attributes?
On Wed, Apr 22, 2009 at 9:48 AM, Adam Kocoloski<[email protected]> wrote:
Hi Wout, thanks for writing this up.
One comment about the map-only views: I think you'll find thatCouch hasalready done a good bit of the work needed to support them, too.Couchmaintains a btree for each design doc keyed on docid that storesall the
view keys emitted by the maps over each document.  When a document is
updated and then analyzed, Couch has to consult that btree, purgeall theKVs associated with the old version of the doc from each view, andtheninsert the new KVs. So the tracking information correlating docidsand view
keys is already available.
You'd still be left with the problem of generating unique docidsfor thedocuments in the Review DB, but I think that's a problem that needsto besolved. The restriction to only MR views with no duplicate keysacross
views seems too strong to me.
With that said, I'd prefer to spend my time extending the viewengine tohandle chainable MR workflows in a single shot. Especially in thesimplesort_by_value case it just seems like a cleaner way to go aboutthings.
 Cheers,

Adam

On Apr 22, 2009, at 8:40 AM, Wout Mertens wrote:
Intro
=====
How do you sort by reduce value? How do you join views? How do youget
unique view results? How do you cache group key reduces?
I think that with the below proposed solution all the above andmore arepossible. The general idea is to store view results and run map/reduce onthem. There's been some discussions about this but they wentnowhere. I've
been thinking about this issue a bit and I think it can be done.

I'd like to call this feature a Review DB.

Use cases
=========
- Suppose you want to know what tags are most popular on yourblog. Simply
get:


 http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
Where tags_by_count is a Review DB that gets input from thetagcount viewand then runs the sort_by_value view on it, a map() function thatsimply
emits (value,key).
Likewise, show pages in order of popularity, whereby user can voteup (+1)
or down (-1):

 http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
- Given documents with attributes title, date and tags. You'd liketo knowthe minimum value of date and a breakdown by count for tags, foreverytitle. Normally you'd use 2 map+reduce views,minimum_date_by_title andtagcount_by_title, which you would then query separately. With aReview DB,you can let both views insert their results in the database andthen run a
view that combines the results in one view:


 http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
- This is not a way to run an on-the-fly map/reduce on a subset ofa view,like if you want to find the median popularity score ofrestaurants with
"Tony" in their name that are close to you.

Implementation
==============
A Review DB is a hidden database maintained by CouchDB with thesefields:
- _id of document is the string representation of the key
- "key" is the key of the incoming view row (unique)
- "value" is the value of the incoming view row
I hope that this is sufficiently like a normal view that it can bestoredas a normal view. _id is just there to make it doc-compliant, itwould be
much better if "key" were the actual key.
A Review DB is defined in a design document like normal views.Each reviewis an entry in the "reviews" hash, and has a "incoming_views"array thatlists all the views that should insert results in the review dbplus thegroup level, as well as a normal "views" hash for further map/reduce of the
review db (and perhaps another "reviews" hash for further result
processing?).
Maintaining a database of results means that results have to beupdated oreven removed when documents change. I tried to make this work (intheory)for map-only views, but the resulting requirements are quitemessy. Youeither need to cache the previous results of a view for eachdocument, oryou have to have an old version of the document available toregenerate
those results.
Therefore, a Review DB only accepts results from one or more map+reduceviews. You define beforehand what the group_level of the keys isthat will
be inserted.
Furthermore, a Review DB disallows (but doesn't enforce) having 2viewsthat generate the same keys. Otherwise, refcounting would need tobe usedand while that's not difficult, I think there's limited value inallowing
this.
The Review DB needs updating every time the reduction for a groupkey ofone of the participating views gets updated. Even though a map+reduce viewhas unique keys, we need a refcount since we have multiple views.Whoever
got to insert its value last wins.
There is a slight complication: group key values are calculated on-the-flyfrom the view result b-tree. So whenever a reduce call results ina newvalue for a b-tree node, AND that node is the upper node of asubtree thatis completely part of a group key, that group key needs to bemarked for
recalculation.

Likewise, if deletion/addition of a b-tree node results in the
removal/creation of the sole upper node of a group key subtree,that group
key needs to be marked for removal/addition.

This is the algorithm:
- When a reducing view gets updated, and it is part of a ReviewDB, usethe 2 paragraphs above to keep a list of group keys that needhandling- After updating the reduce() results, for each of the markedgroup keys:
- If a group key gets removed:
 - look up doc with key=group key in review db. If exists:
   - delete doc
- If a group key gets added:
 - look up doc with key=group key in review db. If exists:
   - set doc.value to the row value
 - else
   - create doc with id=group key in string form, key=group key,
value=value
- If a group key gets updated:
 - look up doc with key=group key in review db. If exists:
   - set doc.value to the row value
 - else
   - create doc with id=group key in string form, key=group key,
value=value
As you can see, this is something CouchDB should do since it knowswhenit's updating group key reduction values and it knows if this wasan delete,
update or addition.
View updates are done when the view is called; Review updates aredone atthis time as well. Views on Review DBs are done when they arecalled.
Summary
=======
Review DBs are a sort of view index that CouchDB can maintain withlittle
overhead. It caches group key results and allows chained map+reduce
calculations using mostly existing frameworks.
I think this would be a very useful feature for CouchDB to have.There are
regularly requests for storing view results in a database for
post-processing on the mailing lists.
I'm not saying this is a trivial change but it doesn't seemtechnicallyimpossible to me either. (unless I missed something again; this isthe 5thiteration of this proposal. Anyway I know *I* wouldn't be able tocode this
:-) )

What do you think, oh dear devs?

Wout.

Re: Proposal: Review DBs

Reply via email to