> Also, I notice that both Google and Hadoop's example of the distributed
sort fails to deal 
> with the fact that the result is multiple sorted files... this isn't a
complete sort since 
> the output files still need to be merge-sorted don't they?  To complete
the algorithm, 
> could the Reducer kick of a subsequent merge sort MapReduce on the result
files?   

By the way, in Hadoop the limitation (if you want to call it that) is that
the reducers copy all map outputs to the local disk(s) of the node where
that is running (for merging and later on reducing). So if you have just one
reducer and plenty of map output, the node running the reducer might just
run out of disk space. Imagine a case where you are sorting 10TB of data and
generating one sorted output file. 
Also, another point worth noting is that by having more reducers it makes
the final output generation more parallel, since multiple reducers run at
the same time and each operates on smaller chunks of the map outputs.

-----Original Message-----
From: James Kennedy [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 22, 2007 11:09 PM
To: [email protected]
Subject: Examples of chained MapReduce?

I was wondering if anyone knows of any examples of truly chained, truly
distributed MapReduce jobs.

So far what I've had trouble finding examples of MapReduce jobs that are
kicked-off by some one time process that in turn kick off other MapReduce
jobs long after the initial driver process is dead.  This would be more
distributed and fault tolerant since it removes dependency on a driver
process.

I looked at the Nutch crawl code for example which iteratively builds up a
url db using successive MapReduces up to a certain depth.  But this all done
from within a for loop of a single process even though each individual
MapReduce is distributed.

Also, I notice that both Google and Hadoop's example of the distributed sort
fails to deal with the fact that the result is multiple sorted files... this
isn't a complete sort since the output files still need to be merge-sorted
don't they?  To complete the algorithm, could the Reducer kick of a
subsequent merge sort MapReduce on the result files?  
Or maybe there's something I'm not understanding...

Reply via email to