+1 to this, my understanding is that you can use the same reduce funcion to re-reduce a stream of data and still get the same results. Is this what actually happens in Riak internally (i.e. the coordinating node only re-reduces each node's reduce) or does the reduce function only run on the coordinating node, which receives the raw map data?
2010/7/23 John Butler <[email protected]> > > Hello, > I went through the Riak Fast Track and overall I'm excited about the > possibilities of Riak. However, I did see one thing that concerns me in > regard to MapReduce jobs. From the Riak docs and this page here: > http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/it > seems to suggest that while Map jobs are performed in parallel Reduce > jobs are not. > Wouldn't be better if each node performed the reduce function on it's own > set of data before sending it to the main coordinator? Imagine if you are > mapping over a high number of objects (millions) and are trying to aggregate > the data grouped by some demographic (such as state). If all the mappers > send the data to coordinating node to be reduced then you are going to send > millions of items from each data node all to one coordinating node. If > instead each node called reduce first, then sent it to the coordinating > node, then each node would be at most sending over 50 objects (for 50 > states). Not only would this greatly reduce network traffic, it would be > able to execute in parallel greatly improving performance. Since all reduce > jobs are communicative, associative and idempotent I see no reason why this > couldn't happen (of course I'm not familiar with the Riak internals to > strongly assert this). > Maybe I misunderstood and this is already happening. If it's not happening, > why and is that on the roadmap to be added later? > -John > _________________________________________________________________ > Hotmail is redefining busy with tools for the New Busy. Get more from your > inbox. > > http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2 > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
