If these assumptions are correct: 0) Each map outputs one result, a few hundred bytes 1) The map output is deterministic, given an input split index 2) Every reducer must see the result from every map
Then just output the result N times, where N is the number of reducers, using a custom Partitioner that assigns the result to (records_seen++ % N), where records_seen is an int field on the partitioner. If (1) does not hold, then write the first stage as job with a single (optional) reduce, and the second stage as a map-only job processing the result. -C On Sun, Feb 13, 2011 at 12:18 PM, Jacques <whs...@gmail.com> wrote: > I'm outputting a small amount of secondary summary information from a map > task that I want to use in the reduce phase of the job. This information is > keyed on a custom input split index. > > Each map task outputs this summary information (less than hundred bytes per > input task). Note that the summary information isn't ready until the > completion of the map task. > > Each reduce task needs to read this information (for all input splits) to > complete its task. > > What is the best way to pass this information to the Reduce stage? I'm > working on java using cdhb2. Ideas I had include: > > 1. Output this data to MapContext.getWorkOutputPath(). However, that data > is not available anywhere in the reduce stage. > 2. Output this data to "mapred.output.dir". The problem here is that the > map task writes immediately to this so failed jobs and speculative execution > could cause collision issues. > 3. Output this data as in (1) and then use Mapper.cleanup() to copy these > files to "mapred.output.dir". Could work but I'm still a little concerned > about collision/race issues as I'm not clear about when a Map task becomes > "the" committed map task for that split. > 4. Use an external system to hold this information and then just call that > system from both phases. This is basically an alternative of #3 and has the > same issues. > > Are there suggested approaches of how to do this? > > It seems like (1) might make the most sense if there is a defined way to > stream secondary outputs from all the mappers within the Reduce.setup() > method. > > Thanks for any ideas. > > Jacques >