Re: Examples of chained MapReduce?

James Kennedy Fri, 22 Jun 2007 13:00:05 -0700

Ah, ok. I think I did read about that partition function in the Googlepaper but I hand't put it quite together yet. So a pre-pass can make apartition function work such that the final output is well sorted. Sono further map reduce necessary and that was a bad example.

But back to my original question... Doug suggests that dependence on adriver process is acceptable. But has anyone needed true MapReducechaining or tried it successfully? Or is it generally accepted that amulti-MapReduce algorithm should always be driven by a single process?


Doug Cutting wrote:

James Kennedy wrote:
So far what I've had trouble finding examples of MapReduce jobs thatare kicked-off by some one time process that in turn kick off otherMapReduce jobs long after the initial driver process is dead. Thiswould be more distributed and fault tolerant since it removesdependency on a driver process.
Yes, but it wouldn't be that much more fault tolerant. The biggestcause of failures isn't particular nodes failing, but that some nodesfail. A driver program only fails if the particular node running itfails. If the MTBF of a particular node is ~1 year, that's probablyokay for a driver program, since driver programs only need to run forhours or days at the most. However it's a problem if you have 1000nodes, and see 3+ failures on average per day, and you require thatall nodes stay up for the duration of a job.
Also, I notice that both Google and Hadoop's example of thedistributed sort fails to deal with the fact that the result ismultiple sorted files... this isn't a complete sort since the outputfiles still need to be merge-sorted don't they? To complete thealgorithm, could the Reducer kick of a subsequent merge sortMapReduce on the result files? Or maybe there's something I'm notunderstanding...
Yes, MapReduce doesn't actually do a full sort. It produces a set ofsorted partitions. Sometimes the partition function can arrangethings so that this is in fact a full sort, but frequently it is justa hash function. Google mentions this in the original MapReduce paper:
   We guarantee that within a given partition, the intermediate
   key/value pairs are processed in increasing key order.
   This ordering guarantee makes it easy to generate
   a sorted output file per partition, which is useful when
   the output file format needs to support efficient random
   access lookups by key, or users of the output find it convenient
   to have the data sorted. (from page 6)

and

   Our partitioning function for this benchmark has builtin
   knowledge of the distribution of keys. In a general
   sorting program, we would add a pre-pass MapReduce
   operation that would collect a sample of the keys and
   use the distribution of the sampled keys to compute splitpoints
   for the final sorting pass. (from page 9)

Doug

Re: Examples of chained MapReduce?

Reply via email to