Thanks Ted, good advice as always :-).

Ted Dunning <[EMAIL PROTECTED]> wrote:  
Doug pointed out that chaining reduces like this is a bad idea.

The reasoning is that reliability is several compromised because map output
is stored locally during the sort phase so that small errors cause serious
problems that may entail a large amount of rework.

The preferred implementation for chained reduces is to simply use multiple
map/reduce phases where the map in the later phases is just the identity
function (or a field permutation which selects a different key).

As I have gained more experience, I am finding that it is very, very common
to have a strong decrease in size of data as you move down the chain. This
is because counting (or something similar) is a very common operation and
counting compresses the heck out of your data.

This compression means that what happens in the downstream phases just
doesn't much matter.

On 9/7/07 9:12 AM, "C G" 
wrote:

> I've seen some traffic where people discuss using multiple reduces, and I'd
> like to understand more about this.
> 
> If you do multiple reduces, does that mean from a data flow perspective:
> 
> map() -> reduce0() -> reduce1() ->...->reduceN-1() -> reduceN() ?
> 
> From an implementation point of view, how do you go about setting up
> multiple reduces?
> 
> Thanks for any advice or pointers to info...
> C G
> 
> 
> ---------------------------------
> Luggage? GPS? Comic books?
> Check out fitting gifts for grads at Yahoo! Search.



       
---------------------------------
Looking for a deal? Find great prices on flights and hotels with Yahoo! 
FareChase.

Reply via email to