I'm outputting a small amount of secondary summary information from a map
task that I want to use in the reduce phase of the job.  This information is
keyed on a custom input split index.

Each map task outputs this summary information (less than hundred bytes per
input task).  Note that the summary information isn't ready until the
completion of the map task.

Each reduce task needs to read this information (for all input splits) to
complete its task.

What is the best way to pass this information to the Reduce stage?  I'm
working on java using cdhb2.   Ideas I had include:

1. Output this data to MapContext.getWorkOutputPath().  However, that data
is not available anywhere in the reduce stage.
2. Output this data to "mapred.output.dir".  The problem here is that the
map task writes immediately to this so failed jobs and speculative execution
could cause collision issues.
3. Output this data as in (1) and then use Mapper.cleanup() to copy these
files to "mapred.output.dir".  Could work but I'm still a little concerned
about collision/race issues as I'm not clear about when a Map task becomes
"the" committed map task for that split.
4. Use an external system to hold this information and then just call that
system from both phases.  This is basically an alternative of #3 and has the
same issues.

Are there suggested approaches of how to do this?

It seems like (1) might make the most sense if there is a defined way to
stream secondary outputs from all the mappers within the Reduce.setup()
method.

Thanks for any ideas.

Jacques

Reply via email to