I was wondering if anyone knows of any examples of truly chained, truly distributed MapReduce jobs.

So far what I've had trouble finding examples of MapReduce jobs that are kicked-off by some one time process that in turn kick off other MapReduce jobs long after the initial driver process is dead. This would be more distributed and fault tolerant since it removes dependency on a driver process.

I looked at the Nutch crawl code for example which iteratively builds up a url db using successive MapReduces up to a certain depth. But this all done from within a for loop of a single process even though each individual MapReduce is distributed.

Also, I notice that both Google and Hadoop's example of the distributed sort fails to deal with the fact that the result is multiple sorted files... this isn't a complete sort since the output files still need to be merge-sorted don't they? To complete the algorithm, could the Reducer kick of a subsequent merge sort MapReduce on the result files? Or maybe there's something I'm not understanding...

Reply via email to