Proposal: Review DBs

Wout Mertens Wed, 22 Apr 2009 05:40:49 -0700

Intro
=====

How do you sort by reduce value? How do you join views? How do you getunique view results? How do you cache group key reduces?

I think that with the below proposed solution all the above and moreare possible. The general idea is to store view results and run map/reduce on them. There's been some discussions about this but they wentnowhere. I've been thinking about this issue a bit and I think it canbe done.


I'd like to call this feature a Review DB.

Use cases
=========

- Suppose you want to know what tags are most popular on your blog.Simply get:


  http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value

Where tags_by_count is a Review DB that gets input from the tagcountview and then runs the sort_by_value view on it, a map() function thatsimply emits (value,key).

Likewise, show pages in order of popularity, whereby user can vote up(+1) or down (-1):

http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value

- Given documents with attributes title, date and tags. You'd like toknow the minimum value of date and a breakdown by count for tags, forevery title. Normally you'd use 2 map+reduce views,minimum_date_by_title and tagcount_by_title, which you would thenquery separately. With a Review DB, you can let both views inserttheir results in the database and then run a view that combines theresults in one view:


  
http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data

- This is not a way to run an on-the-fly map/reduce on a subset of aview, like if you want to find the median popularity score ofrestaurants with "Tony" in their name that are close to you.


Implementation
==============

A Review DB is a hidden database maintained by CouchDB with thesefields:

- _id of document is the string representation of the key
- "key" is the key of the incoming view row (unique)
- "value" is the value of the incoming view row

I hope that this is sufficiently like a normal view that it can bestored as a normal view. _id is just there to make it doc-compliant,it would be much better if "key" were the actual key.

A Review DB is defined in a design document like normal views. Eachreview is an entry in the "reviews" hash, and has a "incoming_views"array that lists all the views that should insert results in thereview db plus the group level, as well as a normal "views" hash forfurther map/reduce of the review db (and perhaps another "reviews"hash for further result processing?).

Maintaining a database of results means that results have to beupdated or even removed when documents change. I tried to make thiswork (in theory) for map-only views, but the resulting requirementsare quite messy. You either need to cache the previous results of aview for each document, or you have to have an old version of thedocument available to regenerate those results.

Therefore, a Review DB only accepts results from one or more map+reduce views. You define beforehand what the group_level of the keysis that will be inserted.

Furthermore, a Review DB disallows (but doesn't enforce) having 2views that generate the same keys. Otherwise, refcounting would needto be used and while that's not difficult, I think there's limitedvalue in allowing this.

The Review DB needs updating every time the reduction for a group keyof one of the participating views gets updated. Even though a map+reduce view has unique keys, we need a refcount since we havemultiple views. Whoever got to insert its value last wins.

There is a slight complication: group key values are calculated on-the-fly from the view result b-tree. So whenever a reduce call results ina new value for a b-tree node, AND that node is the upper node of asubtree that is completely part of a group key, that group key needsto be marked for recalculation.

Likewise, if deletion/addition of a b-tree node results in the removal/creation of the sole upper node of a group key subtree, that group keyneeds to be marked for removal/addition.


This is the algorithm:

- When a reducing view gets updated, and it is part of a Review DB,use the 2 paragraphs above to keep a list of group keys that needhandling- After updating the reduce() results, for each of the marked groupkeys:

 - If a group key gets removed:
   - look up doc with key=group key in review db. If exists:
     - delete doc
 - If a group key gets added:
   - look up doc with key=group key in review db. If exists:
     - set doc.value to the row value
   - else

- create doc with id=group key in string form, key=group key,value=value

 - If a group key gets updated:
   - look up doc with key=group key in review db. If exists:
     - set doc.value to the row value
   - else

- create doc with id=group key in string form, key=group key,value=value

As you can see, this is something CouchDB should do since it knowswhen it's updating group key reduction values and it knows if this wasan delete, update or addition.

View updates are done when the view is called; Review updates are doneat this time as well. Views on Review DBs are done when they are called.


Summary
=======

Review DBs are a sort of view index that CouchDB can maintain withlittle overhead. It caches group key results and allows chained map+reduce calculations using mostly existing frameworks.

I think this would be a very useful feature for CouchDB to have. Thereare regularly requests for storing view results in a database for post-processing on the mailing lists.

I'm not saying this is a trivial change but it doesn't seemtechnically impossible to me either. (unless I missed something again;this is the 5th iteration of this proposal. Anyway I know *I* wouldn'tbe able to code this :-) )


What do you think, oh dear devs?

Wout.

Proposal: Review DBs

Reply via email to