Such as having view definition, in the design doc, contain an array of objects, each with the map/reduce function pair attributes?
On Wed, Apr 22, 2009 at 9:48 AM, Adam Kocoloski <[email protected]> wrote: > Hi Wout, thanks for writing this up. > > One comment about the map-only views: I think you'll find that Couch has > already done a good bit of the work needed to support them, too. Couch > maintains a btree for each design doc keyed on docid that stores all the > view keys emitted by the maps over each document. When a document is > updated and then analyzed, Couch has to consult that btree, purge all the > KVs associated with the old version of the doc from each view, and then > insert the new KVs. So the tracking information correlating docids and view > keys is already available. > > You'd still be left with the problem of generating unique docids for the > documents in the Review DB, but I think that's a problem that needs to be > solved. The restriction to only MR views with no duplicate keys across > views seems too strong to me. > > With that said, I'd prefer to spend my time extending the view engine to > handle chainable MR workflows in a single shot. Especially in the simple > sort_by_value case it just seems like a cleaner way to go about things. > Cheers, > > Adam > > On Apr 22, 2009, at 8:40 AM, Wout Mertens wrote: > >> Intro >> ===== >> How do you sort by reduce value? How do you join views? How do you get >> unique view results? How do you cache group key reduces? >> >> I think that with the below proposed solution all the above and more are >> possible. The general idea is to store view results and run map/reduce on >> them. There's been some discussions about this but they went nowhere. I've >> been thinking about this issue a bit and I think it can be done. >> >> I'd like to call this feature a Review DB. >> >> Use cases >> ========= >> - Suppose you want to know what tags are most popular on your blog. Simply >> get: >> >> >> http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value >> >> Where tags_by_count is a Review DB that gets input from the tagcount view >> and then runs the sort_by_value view on it, a map() function that simply >> emits (value,key). >> >> Likewise, show pages in order of popularity, whereby user can vote up (+1) >> or down (-1): >> >> http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value >> >> - Given documents with attributes title, date and tags. You'd like to know >> the minimum value of date and a breakdown by count for tags, for every >> title. Normally you'd use 2 map+reduce views, minimum_date_by_title and >> tagcount_by_title, which you would then query separately. With a Review DB, >> you can let both views insert their results in the database and then run a >> view that combines the results in one view: >> >> >> http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data >> >> - This is not a way to run an on-the-fly map/reduce on a subset of a view, >> like if you want to find the median popularity score of restaurants with >> "Tony" in their name that are close to you. >> >> Implementation >> ============== >> A Review DB is a hidden database maintained by CouchDB with these fields: >> - _id of document is the string representation of the key >> - "key" is the key of the incoming view row (unique) >> - "value" is the value of the incoming view row >> >> I hope that this is sufficiently like a normal view that it can be stored >> as a normal view. _id is just there to make it doc-compliant, it would be >> much better if "key" were the actual key. >> >> A Review DB is defined in a design document like normal views. Each review >> is an entry in the "reviews" hash, and has a "incoming_views" array that >> lists all the views that should insert results in the review db plus the >> group level, as well as a normal "views" hash for further map/reduce of the >> review db (and perhaps another "reviews" hash for further result >> processing?). >> >> Maintaining a database of results means that results have to be updated or >> even removed when documents change. I tried to make this work (in theory) >> for map-only views, but the resulting requirements are quite messy. You >> either need to cache the previous results of a view for each document, or >> you have to have an old version of the document available to regenerate >> those results. >> >> Therefore, a Review DB only accepts results from one or more map+reduce >> views. You define beforehand what the group_level of the keys is that will >> be inserted. >> >> Furthermore, a Review DB disallows (but doesn't enforce) having 2 views >> that generate the same keys. Otherwise, refcounting would need to be used >> and while that's not difficult, I think there's limited value in allowing >> this. >> >> The Review DB needs updating every time the reduction for a group key of >> one of the participating views gets updated. Even though a map+reduce view >> has unique keys, we need a refcount since we have multiple views. Whoever >> got to insert its value last wins. >> >> There is a slight complication: group key values are calculated on-the-fly >> from the view result b-tree. So whenever a reduce call results in a new >> value for a b-tree node, AND that node is the upper node of a subtree that >> is completely part of a group key, that group key needs to be marked for >> recalculation. >> >> Likewise, if deletion/addition of a b-tree node results in the >> removal/creation of the sole upper node of a group key subtree, that group >> key needs to be marked for removal/addition. >> >> This is the algorithm: >> - When a reducing view gets updated, and it is part of a Review DB, use >> the 2 paragraphs above to keep a list of group keys that need handling >> - After updating the reduce() results, for each of the marked group keys: >> - If a group key gets removed: >> - look up doc with key=group key in review db. If exists: >> - delete doc >> - If a group key gets added: >> - look up doc with key=group key in review db. If exists: >> - set doc.value to the row value >> - else >> - create doc with id=group key in string form, key=group key, >> value=value >> - If a group key gets updated: >> - look up doc with key=group key in review db. If exists: >> - set doc.value to the row value >> - else >> - create doc with id=group key in string form, key=group key, >> value=value >> >> As you can see, this is something CouchDB should do since it knows when >> it's updating group key reduction values and it knows if this was an delete, >> update or addition. >> >> View updates are done when the view is called; Review updates are done at >> this time as well. Views on Review DBs are done when they are called. >> >> Summary >> ======= >> Review DBs are a sort of view index that CouchDB can maintain with little >> overhead. It caches group key results and allows chained map+reduce >> calculations using mostly existing frameworks. >> >> I think this would be a very useful feature for CouchDB to have. There are >> regularly requests for storing view results in a database for >> post-processing on the mailing lists. >> >> I'm not saying this is a trivial change but it doesn't seem technically >> impossible to me either. (unless I missed something again; this is the 5th >> iteration of this proposal. Anyway I know *I* wouldn't be able to code this >> :-) ) >> >> What do you think, oh dear devs? >> >> Wout. > >
