On 05/08/2008, at 3:49 PM, Chris Anderson wrote:
I think the missing link here is the ability to "remap" map and map/reduce results. In Hadoop-style map/reduce, the output of a single map will often be remapped in different ways for different purposes. Being able to share the intermediate results among further reprocessing is helpful, and often people will chain long stretches of map reduce processing.
Well, that would be absolutely fantastic if it came to pass. I didn't really think it was on the roadmap anytime soon though.
The challenge for the CouchDB programming model for supporting chained map/reduces is the cache-expiry issue. How can we tell which index entries to sweep when a document is changed or deleted, when that index is itself generated by running map/reduce over another index? I tell myself that the bookkeeping is possible, but it sure sounds like a big job.
Hm. Not to risk exposing myself (accurately) as someone who has no grasp whatsoever of the complexities of such things - could a similar approach to the current _rev system be used?
Perhaps you could have two levels of revisions - one for the total view, which changed whenever anything in the view changed. That would signal the re-reducing view that it needed to go look at the view again.
And then a second level of revisions could be on individual key "row" output. The re-reduce could then just look at the ones that changed - it would simply drop any revs it had but didn't appear in the new listing, and import any that did - that would handle additions/ removals as well.
I'm probably oversimplifying things? Basically just trying to think of "the simplest thing that could possibly work" ...
I have a prototype of remapping (with no cache-awareness) in CouchRest's git repo http://github.com/jchris/couchrest/tree/master/utils/remap.rb
Thanks - I'd been reading that anyway after noticing the bump to 0.9.0. Looks great, will try it out!
You're making sense, but I also wouldn't mind code examples :)
Sure, if you can stomach my awful code ... http://friendpaste.com/DYsves9sIn that (unedited, confused, messy) example I create and utilise two methods of getting at the membership data. The first is the one I discussed, ie caching it all in the Membership class. The second is to then place those caches in the "remote" record itself. Well, basically I just try lots of things, and it's all there if you can stand huge messes of experimentation : )
If you want to run it, you'll need edge DataMapper - as in, from the last 24 hours. This is probably the easiest way to get it:
http://datamapper.org/articles/stunningly_easy_way_to_live_on_the_edge.html Hope that's useful for someone. Sho
-- Chris Anderson http://jchris.mfdz.com
smime.p7s
Description: S/MIME cryptographic signature
