Re: Examples of chained MapReduce?

Doug Cutting Fri, 22 Jun 2007 11:52:01 -0700

James Kennedy wrote:

So far what I've had trouble finding examples of MapReduce jobs that arekicked-off by some one time process that in turn kick off otherMapReduce jobs long after the initial driver process is dead. Thiswould be more distributed and fault tolerant since it removes dependencyon a driver process.

Yes, but it wouldn't be that much more fault tolerant. The biggestcause of failures isn't particular nodes failing, but that some nodesfail. A driver program only fails if the particular node running itfails. If the MTBF of a particular node is ~1 year, that's probablyokay for a driver program, since driver programs only need to run forhours or days at the most. However it's a problem if you have 1000nodes, and see 3+ failures on average per day, and you require that allnodes stay up for the duration of a job.

Also, I notice that both Google and Hadoop's example of the distributedsort fails to deal with the fact that the result is multiple sortedfiles... this isn't a complete sort since the output files still need tobe merge-sorted don't they? To complete the algorithm, could theReducer kick of a subsequent merge sort MapReduce on the result files?Or maybe there's something I'm not understanding...

Yes, MapReduce doesn't actually do a full sort. It produces a set ofsorted partitions. Sometimes the partition function can arrange thingsso that this is in fact a full sort, but frequently it is just a hashfunction. Google mentions this in the original MapReduce paper:


   We guarantee that within a given partition, the intermediate
   key/value pairs are processed in increasing key order.
   This ordering guarantee makes it easy to generate
   a sorted output file per partition, which is useful when
   the output file format needs to support efficient random
   access lookups by key, or users of the output find it convenient
   to have the data sorted. (from page 6)

and

   Our partitioning function for this benchmark has builtin
   knowledge of the distribution of keys. In a general
   sorting program, we would add a pre-pass MapReduce
   operation that would collect a sample of the keys and
   use the distribution of the sampled keys to compute splitpoints
   for the final sorting pass. (from page 9)

Doug

Re: Examples of chained MapReduce?

Reply via email to