Thanks Ted, good advice as always :-).
Ted Dunning <[EMAIL PROTECTED]> wrote:
Doug pointed out that chaining reduces like this is a bad idea.
The reasoning is that reliability is several compromised because map output
is stored locally during the sort phase so that small errors cause serious
problems that may entail a large amount of rework.
The preferred implementation for chained reduces is to simply use multiple
map/reduce phases where the map in the later phases is just the identity
function (or a field permutation which selects a different key).
As I have gained more experience, I am finding that it is very, very common
to have a strong decrease in size of data as you move down the chain. This
is because counting (or something similar) is a very common operation and
counting compresses the heck out of your data.
This compression means that what happens in the downstream phases just
doesn't much matter.
On 9/7/07 9:12 AM, "C G"
wrote:
> I've seen some traffic where people discuss using multiple reduces, and I'd
> like to understand more about this.
>
> If you do multiple reduces, does that mean from a data flow perspective:
>
> map() -> reduce0() -> reduce1() ->...->reduceN-1() -> reduceN() ?
>
> From an implementation point of view, how do you go about setting up
> multiple reduces?
>
> Thanks for any advice or pointers to info...
> C G
>
>
> ---------------------------------
> Luggage? GPS? Comic books?
> Check out fitting gifts for grads at Yahoo! Search.
---------------------------------
Looking for a deal? Find great prices on flights and hotels with Yahoo!
FareChase.