RE: Reduce phase only on one node?

John Butler Sat, 24 Jul 2010 06:21:02 -0700

Thank you.
-John


________________________________
> Date: Fri, 23 Jul 2010 17:14:51 -0700
> Subject: Re: Reduce phase only on one node?
> From: [email protected]
> To: [email protected]
> CC: [email protected]; [email protected]
>
> Currently, reduce functions run on the coordinating node. The coordinating 
> node runs 2 processes per reduce phase to achieve some parallelism. There are 
> two features requests open to improve the implementation:
>
>
> Allow users to toggle the number of processes used during reduce
> https://issues.basho.com/148
>
> Distribute reduce phases across the cluster
>
> https://issues.basho.com/149
>
> There is no target release set for either issue but you should be able to add 
> yourself to the CC list in Bugzilla to receive notifications.
>
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> [email protected]
>
>
>
> On Fri, Jul 23, 2010 at 2:34 PM, John D. Rowell> wrote:
>
> +1 to this, my understanding is that you can use the same reduce funcion to 
> re-reduce a stream of data and still get the same results. Is this what 
> actually happens in Riak internally (i.e. the coordinating node only 
> re-reduces each node's reduce) or does the reduce function only run on the 
> coordinating node, which receives the raw map data?
>
>
>
> 2010/7/23 John Butler>
>
>
>
>
>
> Hello,
>
> I went through the Riak Fast Track and overall I'm excited about the 
> possibilities of Riak. However, I did see one thing that concerns me in 
> regard to MapReduce jobs. From the Riak docs and this page here: 
> http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/
>  it seems to suggest that while Map jobs are performed in parallel Reduce 
> jobs are not.
>
>
>
> Wouldn't be better if each node performed the reduce function on it's own set 
> of data before sending it to the main coordinator? Imagine if you are mapping 
> over a high number of objects (millions) and are trying to aggregate the data 
> grouped by some demographic (such as state). If all the mappers send the data 
> to coordinating node to be reduced then you are going to send millions of 
> items from each data node all to one coordinating node. If instead each node 
> called reduce first, then sent it to the coordinating node, then each node 
> would be at most sending over 50 objects (for 50 states). Not only would this 
> greatly reduce network traffic, it would be able to execute in parallel 
> greatly improving performance. Since all reduce jobs are communicative, 
> associative and idempotent I see no reason why this couldn't happen (of 
> course I'm not familiar with the Riak internals to strongly assert this).
>
>
>
> Maybe I misunderstood and this is already happening. If it's not happening, 
> why and is that on the roadmap to be added later?
>
> -John
>
> _________________________________________________________________
>
> Hotmail is redefining busy with tools for the New Busy. Get more from your 
> inbox.
>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
>
>
>
> _______________________________________________
>
> riak-users mailing list
>
> [email protected]
>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> _______________________________________________
>
> riak-users mailing list
>
> [email protected]
>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
                                          
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

RE: Reduce phase only on one node?

Reply via email to