BTW, each key appears exactly once in the large constant dataset, and 
exactly once in each MR job's output.

I am thinking the right approach is to consistently partition the job 
output and the large constant dataset, with the number of partitions being 
the number of reduce tasks; each part goes into its own file.  Make an 
InputFormat whose number of splits equals the number of reduce tasks. 
Reading a split will consist of reading a corresponding pair of files, 
stepping through each.  Seems like something that should already be 
provided by something in org.apache.hadoop.mapreduce.*.

Thanks,
Mike

Reply via email to