Definitely been meaning to provide a review for this. I'll try and get to it in a few hours.
On Fri, Mar 15, 2013 at 2:36 PM, Noah Slater <[email protected]> wrote: > Hey Nick, > > This looks pretty great. Is it something you need assistance with? Anything > I can do to help? > > Thanks, > > > On 22 January 2013 15:18, nicholas a. evans <[email protected]> wrote: > >> I've started hacking on an implementation of iterative map reduce that I'm >> calling "nested views". It's not nearly finished, but well overdue for >> feedback from people who know what they're doing with CouchDB and Erlang. >> :) >> >> **************************** >> My bosses are open to a short term contract with an Erlang hacking CouchDB >> committer to get this into a safely usable state ASAP. >> **************************** >> >> == Intended Usage >> >> To create a nested view, add "views" inside of a view like so: >> >> { >> "_id": "_design/ddoc", >> "language": "javascript", >> "views": { >> "foo_count": { >> "map": "function (doc) { emit(doc.foo, 1); }", >> "reduce": "_sum", >> "views": { >> "sorted_by_count": { >> "map":"function(doc) { emit(doc.value, nil); }" >> } >> } >> } >> } >> } >> >> The parent row is reduced with group_level=exact, and each reduced row is >> sent to the nested view map functions as a document that looks like so: >> >> { >> "_id": json_encode(parent_view_key), >> "key": parent_view_key, "value": reduced_value, >> } >> >> Nested views would be queried using the standard view API like so: >> >> GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args >> >> Using include_docs=true would return the reduced parent row as the doc. >> >> == Comparison to Cloudant's "dbcopy" >> >> The trade-offs between this approach and chaining to a different design doc >> or a different database are similar to the trade-offs of storing all of >> your >> views in one design doc versus a view per design doc: >> >> * simplicity of one design doc vs multiple >> * consistency within design doc index vs eventual consistency >> * high vs low latency between doc insertion and view availability >> * entire index is invalidated when one function changes >> * some views may need to be compacted more frequently than others >> >> I have some application code that could take advantage of each approach, >> and >> ultimately I'd like to see both styles supported natively by Apache >> CouchDB. >> >> == Implementation notes >> >> New record fields: #mrview{path=[], nested_id_btree=nil}. >> >> Each nested view creates a new View#mrview object which is stored directly >> on the #mrstate.views list. The view state saved in the #mrheader will >> include a second btree in parent views. View#mrview.nested_id_btree keeps >> track of ToRemove entries for incremental updates (analogous to >> #mrstate.id_btree). View#mrview.path represents the "path" of parent >> views from >> the root as a list. It will be used used (by >> couch_mrview_util:get_view/4) for >> view lookup, influences the sort order of the views in #mrst.views, >> simplifies >> include_docs, detects "safe" top-level views during the initial mapper >> index >> updater pass, and ensures uniqueness when comparing one view to another >> (e.g. >> views could share their map definitions, but have different data sources). >> If >> two different paths share identical function code, they will still result >> in two >> View indexes. >> >> In origin/master, couch_mrview_updater:update/2 starts up two linked >> processes from the parent index updater process; one for mapping, one for >> writing (reductions stored in the btree are handled by the couch_btree >> module). I've created a third process to handle the nested mapping. The >> writer >> process traverses the view tree depth-first, and should send the reduced >> rows >> (group_level=exact) to the nested mapper process. The writer process >> waits for >> the nested results before continuing, because otherwise the mrstate would >> claim >> a higher update_seq than the nested views have completed indexing and the >> index >> file would be temporarily inconsistent. >> >> My (incomplete) code is at http://github.com/nevans/couchdb (in master). >> I did >> a lot of refactoring just so I could understand the CouchDB code better >> and to >> test my knowledge of Erlang. Most of these were splitting up large >> functions, >> renaming variables, and changing recursive functions into higher order >> functions >> passed to lists:foldl (I have the same aversion to explicit looping in >> ruby when >> Enumerable#inject can do the job). I noticed that master doesn't contain a >> single merge commit(!) so obviously this would need to be cleaned up a lot >> (cosmetic refactorings could be backed out or placed in separate commits) >> if >> there is ever any interest in merging it. >> >> == Implementation progress >> >> First-pass "done" so far: >> >> * Tests! >> * First do no harm: still passing original tests. >> * Stupid basic test that creates a single simple nested view >> and tries to update it >> * Creation >> * Parse the design docs to create the nested views. >> * Updates >> * Traversal of nested views by writer process. >> * Visiting each nested view by nested mapper process. >> * although its buggy and crashes when I send in sample data. >> * Retrieval >> * nothing yet... >> >> Still incomplete in my branch (with some questions): >> >> * Tests! >> * more complicated nested views >> * queried nested views >> * queried with include_docs >> >> * Creation >> * Add the nested views to the mrheader; read them in and write them out. >> couch_mrview_util:init_state/4 >> couch_mrview_util:open_view/5 >> * Initialize the nested_id_btree (with appropriate Reduce, Less, etc). >> couch_mrview_util:open_view/5 >> * Consider nested view defs in computing the index file's MD5. >> couch_mrview_util:ddoc_to_mrst/2 >> Is this automatically accomplished by updating #mrst.views and >> #mrview? >> >> * Updates >> * Update sample data without crashing! >> * Pull in the group/reduced rows and format them for the nested mapper. >> * This keeps stumping me. >> >> * Retrieval >> * Query nested views at all. >> * Query nested views with include_docs. >> >> * Etc (does this matter?) >> * couch_mrview_util:calculate_data_size/2 >> * ??? >> >> == Questions >> >> As an Erlang newbie hacking on unfamiliar source code, I hope I haven't >> made >> extra work for myself by going with this approach vs a view changes feed >> multi-db Cloudant-style approach... Is there anything incredibly stupid >> about >> this approach? :-) >> >> Any thoughts on the intended API? >> >> I'm having a hard time querying the view in its current state from the >> internals >> of couch. I've got the partially updated View state object with its >> updated >> Btrees, which is *not* the same View and Btree that would come back if I >> were to >> use couch_mrview_util:get_view(Db, DDoc, Vname, Args). But I haven't been >> able >> to successfully hack together a version of couch_mrview:query_view that >> will do >> what I want. >> >> Can incremental updates for the most common use case (alternate sort >> indexes) be handled by something simpler than a completely generic >> iterative >> map reduce? >> >> I've branched from origin/master. If I want to put this into production >> for >> myself as soon as its "done", should I use the 1.3 branch instead? >> >> I've probably created some performance regressions in the nested view >> traversal. >> Any thoughts there? How should I test that (other than by timing my own >> datasets)? Right now, getting this done is more important for us than >> getting >> it done optimally. >> >> Is there a "view index version" number that should be incremented >> somewhere, >> so the new mrheader format doesn't break old views? Is there a good way to >> update a view file in place (e.g. see an old header and rewrite a new >> header)? >> >> Since the nested map and nested write queues never have more than one entry >> at a time, should I even bother with the overhead of couch_work_queue? >> Should I just use the plain old Erlang "send message receive response" >> pattern instead? Since it is done synchronously, should it even use a >> separate process? To avoid CPU bottlenecks, it like "merge_results" should >> probably also be in its own process? >> >> On a side not, it seems wasteful to constantly be passing source code to >> the >> view servers. Have you considered using a pattern like redis's evalsha >> scripting? First check (using the checksums for the functions) to see if >> the >> functions have been loaded on that view server. If not then load them. >> All >> other references to functions use the checksum instead. Perhaps >> Spidermonkey is >> smart enough to keep an internal function cache, but it still seems like >> non-trivial IO bandwidth could be wasted on large function bodies. >> >> >> Since you've read this far, a reminder: My bosses are open to a short term >> contract with an Erlang hacking CouchDB committer to get this into a safely >> usable state ASAP. >> >> Thanks for any help! I'll also be available in #couchdb and #couchdb-dev. >> >> Nick Evans >> > > > > -- > NS
