Re: What is the right way to do map-side joins in Hadoop 1.0?

Mike Spreitzer Sun, 15 Jan 2012 03:05:25 -0800

BTW, each key appears exactly once in the large constant dataset, and 
exactly once in each MR job's output.


I am thinking the right approach is to consistently partition the job 
output and the large constant dataset, with the number of partitions being 
the number of reduce tasks; each part goes into its own file.  Make an 
InputFormat whose number of splits equals the number of reduce tasks. 
Reading a split will consist of reading a corresponding pair of files, 
stepping through each.  Seems like something that should already be 
provided by something in org.apache.hadoop.mapreduce.*.

Thanks,
Mike

Re: What is the right way to do map-side joins in Hadoop 1.0?

Reply via email to