> Also, I notice that both Google and Hadoop's example of the distributed sort fails to deal > with the fact that the result is multiple sorted files... this isn't a complete sort since > the output files still need to be merge-sorted don't they? To complete the algorithm, > could the Reducer kick of a subsequent merge sort MapReduce on the result files?
By the way, in Hadoop the limitation (if you want to call it that) is that the reducers copy all map outputs to the local disk(s) of the node where that is running (for merging and later on reducing). So if you have just one reducer and plenty of map output, the node running the reducer might just run out of disk space. Imagine a case where you are sorting 10TB of data and generating one sorted output file. Also, another point worth noting is that by having more reducers it makes the final output generation more parallel, since multiple reducers run at the same time and each operates on smaller chunks of the map outputs. -----Original Message----- From: James Kennedy [mailto:[EMAIL PROTECTED] Sent: Friday, June 22, 2007 11:09 PM To: [email protected] Subject: Examples of chained MapReduce? I was wondering if anyone knows of any examples of truly chained, truly distributed MapReduce jobs. So far what I've had trouble finding examples of MapReduce jobs that are kicked-off by some one time process that in turn kick off other MapReduce jobs long after the initial driver process is dead. This would be more distributed and fault tolerant since it removes dependency on a driver process. I looked at the Nutch crawl code for example which iteratively builds up a url db using successive MapReduces up to a certain depth. But this all done from within a for loop of a single process even though each individual MapReduce is distributed. Also, I notice that both Google and Hadoop's example of the distributed sort fails to deal with the fact that the result is multiple sorted files... this isn't a complete sort since the output files still need to be merge-sorted don't they? To complete the algorithm, could the Reducer kick of a subsequent merge sort MapReduce on the result files? Or maybe there's something I'm not understanding...
