James Kennedy wrote:
So far what I've had trouble finding examples of MapReduce jobs that are kicked-off by some one time process that in turn kick off other MapReduce jobs long after the initial driver process is dead. This would be more distributed and fault tolerant since it removes dependency on a driver process.
Yes, but it wouldn't be that much more fault tolerant. The biggest cause of failures isn't particular nodes failing, but that some nodes fail. A driver program only fails if the particular node running it fails. If the MTBF of a particular node is ~1 year, that's probably okay for a driver program, since driver programs only need to run for hours or days at the most. However it's a problem if you have 1000 nodes, and see 3+ failures on average per day, and you require that all nodes stay up for the duration of a job.
Also, I notice that both Google and Hadoop's example of the distributed sort fails to deal with the fact that the result is multiple sorted files... this isn't a complete sort since the output files still need to be merge-sorted don't they? To complete the algorithm, could the Reducer kick of a subsequent merge sort MapReduce on the result files? Or maybe there's something I'm not understanding...
Yes, MapReduce doesn't actually do a full sort. It produces a set of sorted partitions. Sometimes the partition function can arrange things so that this is in fact a full sort, but frequently it is just a hash function. Google mentions this in the original MapReduce paper:
We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted. (from page 6) and Our partitioning function for this benchmark has builtin knowledge of the distribution of keys. In a general sorting program, we would add a pre-pass MapReduce operation that would collect a sample of the keys and use the distribution of the sampled keys to compute splitpoints for the final sorting pass. (from page 9) Doug
