Ah, ok. I think I did read about that partition function in the Google
paper but I hand't put it quite together yet. So a pre-pass can make a
partition function work such that the final output is well sorted. So
no further map reduce necessary and that was a bad example.
But back to my original question... Doug suggests that dependence on a
driver process is acceptable. But has anyone needed true MapReduce
chaining or tried it successfully? Or is it generally accepted that a
multi-MapReduce algorithm should always be driven by a single process?
Doug Cutting wrote:
James Kennedy wrote:
So far what I've had trouble finding examples of MapReduce jobs that
are kicked-off by some one time process that in turn kick off other
MapReduce jobs long after the initial driver process is dead. This
would be more distributed and fault tolerant since it removes
dependency on a driver process.
Yes, but it wouldn't be that much more fault tolerant. The biggest
cause of failures isn't particular nodes failing, but that some nodes
fail. A driver program only fails if the particular node running it
fails. If the MTBF of a particular node is ~1 year, that's probably
okay for a driver program, since driver programs only need to run for
hours or days at the most. However it's a problem if you have 1000
nodes, and see 3+ failures on average per day, and you require that
all nodes stay up for the duration of a job.
Also, I notice that both Google and Hadoop's example of the
distributed sort fails to deal with the fact that the result is
multiple sorted files... this isn't a complete sort since the output
files still need to be merge-sorted don't they? To complete the
algorithm, could the Reducer kick of a subsequent merge sort
MapReduce on the result files? Or maybe there's something I'm not
understanding...
Yes, MapReduce doesn't actually do a full sort. It produces a set of
sorted partitions. Sometimes the partition function can arrange
things so that this is in fact a full sort, but frequently it is just
a hash function. Google mentions this in the original MapReduce paper:
We guarantee that within a given partition, the intermediate
key/value pairs are processed in increasing key order.
This ordering guarantee makes it easy to generate
a sorted output file per partition, which is useful when
the output file format needs to support efficient random
access lookups by key, or users of the output find it convenient
to have the data sorted. (from page 6)
and
Our partitioning function for this benchmark has builtin
knowledge of the distribution of keys. In a general
sorting program, we would add a pre-pass MapReduce
operation that would collect a sample of the keys and
use the distribution of the sampled keys to compute splitpoints
for the final sorting pass. (from page 9)
Doug