I was wondering if anyone knows of any examples of truly chained, truly
distributed MapReduce jobs.
So far what I've had trouble finding examples of MapReduce jobs that are
kicked-off by some one time process that in turn kick off other
MapReduce jobs long after the initial driver process is dead. This
would be more distributed and fault tolerant since it removes dependency
on a driver process.
I looked at the Nutch crawl code for example which iteratively builds up
a url db using successive MapReduces up to a certain depth. But this
all done from within a for loop of a single process even though each
individual MapReduce is distributed.
Also, I notice that both Google and Hadoop's example of the distributed
sort fails to deal with the fact that the result is multiple sorted
files... this isn't a complete sort since the output files still need to
be merge-sorted don't they? To complete the algorithm, could the
Reducer kick of a subsequent merge sort MapReduce on the result files?
Or maybe there's something I'm not understanding...